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Loop  computations  are  the  basic  constructs  which  consume  most 
of  the  computation  time  in  ordinary  programs.   For  a  multiprocessing 
(parallel  or  pipelined)  system,  the  overall  speedup  which  might  be  other- 
wise obtained  will  be  seriously  degraded  if  these  computations  are  carried 
out  in  the  sequential  manner  as  they  are  presented. 

In  this  paper,  we  present  algorithms  to  expose  vector  operations 
and  recurrence  relations  which  constitute  a  loop  computation  and,  in  turn, 
to  transform  recurrence  relations  into  vector  operations.   For  the  latter 
case,  time  and  processor  bounds  for  different  classes  of  problems  and 
practical  situations  will  be  discussed.   Bounds  on  speedups  of  different 
types  of  loop  computations  will  also  be  given,  so  that  machine  designers  or 
users  could  know  how  well  they  were  doing  in  some  absolute  sense. 
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1 .   INTRODUCTION 

The  primary  objective  of  this  research  is  to  develop  fast 
and  efficient  algorithms  to  evaluate  programs  which  are  presented  in 
serial  form.   Specifically,  the  recurrence  relations  in  loop  computa- 
tions which  are  encountered  quite  often  in  ordinary  programs  such  as 
Fortran  DO  loops  and  Algol  FOR  loops  will  be  the  center  of  our  concern. 

During  the  last  decade,  as  the  cost  and  size  of  hardware 
circuits  became  remarkably  cheaper  and  smaller,  and  as  the  speed  of 
electronic  devices  approaches  its  physical  limits ,  multiprocessing 
systems  were  introduced  in  an  attempt  to  speed  up  machines  by  allowing 
concurrent  processing.  As  a  result,  various  types  of  multiprocessing 
systems  have  emerged.   These  include  ILLIAC  IV,  CDC  STAR,  TI  ASC,  C.mmp 
and  many  others,  each  enjoying  some  success  in  the  computations  tailored 
to  its  own  structure,  and  hence  generally  intended  for  a  special  group 
of  users. 

However,  we  believe  that  the  future  trend  of  multiprocessing 
system  designs  will  and  should  make  the  high-speed  computational  power 
of  such  machines  available  to  ordinary  users  [1],   Two  basic  design  and 
application  principles  which  help  to  attain  the  full  effectiveness  of 
multioperational  machines  are: 

1)  Extracting  parallelism  from  ordinary  programs  as  much  as 
possible.   Obviously,  a  plentiful  supply  of  possible  concurrent  operations 
will  exploit  the  highest  speed  of  any  given  or  proposed  system.   There 
may  be  two  approaches : 


a)  Devising  new  parallel  algorithms  and  providing 
existing  languages  with  adequate  parallel  processing  features. 

b)  Automatic  detection  of  parallelism  at  the  compiler 
level. 

2)   Minimizing  data  accessing  and  alignment  conflicts  in  the 
system  in  order  to  achieve  high-speed  processing.   This  may  be  achieved 
by  properly  designed  memory  systems,  storage  schemes  and  data  alignment 
networks  which  combine  to  allow  conflict-free  accessing  and  processing 
of  various  slices  of  data  required  in  most  ordinary  programs  [2],  [3],  [4]. 

Loop  computations  are  the  basic  constructs  which  consume  most 
of  the  computation  time  in  ordinary  programs.   This  is  true  intuitively 
as  well  as  in  actual  experimental  measurements  [5],  [6],  [7].   For  a 
multiprocessing  system,  the  overall  speedup  which  might  be  otherwise 
obtained  will  be  seriously  degraded  if  these  computations  are  carried  out 
in  the  sequential  manner  as  they  are  presented.   Therefore,  according  to 
the  above  principles,  they  are  very  good  and  important  candidates 
deserving  careful  theoretical  and  practical  studies  if  future  machines 
of  this  class  are  to  be  in  general  use  at  all. 

Most  previous  results  pertaining  to  this  problem  treat  indi- 
vidually each  principle  stated  before.   These  will  be  mentioned  in  later 
chapters  with  their  related  topics.   In  this  thesis,  we  would  like  to 
look  at  the  problem  from  a  unified  point  of  view.   Briefly  put,  we  ask 
ourselves  if  there  is  any  fast  algorithm  (or  algorithms)  which  we  can 
use  for  evaluation  of  recurrence  relations  such  that: 

1)   They  may  be  written  in  numerical  subroutines  by  extending 


current  languages  with  some  useful  features, 

2)  Upon  recognition  in  program  loops,  it  should  be  easy  for 
a  compiler  to  transfer  these  iterative  constructs  into  the  equivalent 
but  fast  evaluations  proposed  by  these  algorithms, 

3)  With  care,  it  is  possible  to  use  or  design  a  well-organized 
system  to  achieve  a  desirable  speed. 

All  the  algorithms  developed  herein  have  some  implicit  bearing 
on  certain  (although  not  all)  solutions  to  these  three  questions.   In 
short,  we  try  to  devise  algorithms  for  easy  implementations  through 
user  subroutines  or  compiler  algorithms,  and  the  assumed  general  machine 
structures  and  data  routing  patterns  underlying  these  algorithms  should 
enhance  their  maximum  speeds. 

In  Chapter  2,  we  discuss  the  algorithms  to  evaluate  linear 
recurrence  relations  with  arbitrary  coefficients  for  both  general  and 
m-th  order  problems.   Chapter  3  emphasizes  the  cases  of  constant  co- 
efficients which  occur  frequently  in  general  numerical  programs.   All 
algorithms  as  they  stand  are  for  the  fastest  computations  and  anticipate 
future  machines  with  as  many  processors  as  the  number  of  IC  packages 
used  in  current  machines.   Also,  some  easy  and  general  modification 
procedures  are  introduced  to  maintain  substantial  speedups  while  using 
processors  more  efficiently.   Some  representative  experimental  results 
to  this  effect  will  be  included.   These  types  of  information  and 
practicalities  should  be  able  to  lead  us  to  good  machine  design  and 
application  procedures  in  future  parallel  machines.   On  the  other  hand, 
for  realistic  designs  and  uses  at  present,  Chapter  4  provides  some 
methods  to  evaluate  most  practical  m-th  order  linear  recurrences  where  m 


is  very  small  and  under  the  assumption  that  usually  the  number  of 
processors  is  far  less  than  the  problem  size.   Extending  the  results 
from  these  three  chapters,  we  then  can  treat  general  jump-free  loop 
computations  in  a  uniform  way  either  theoretically  or  practically  in 
Chapter  5.   Bounds  on  speedups  of  different  types  of  loop  structures 
will  be  given  so  that  designers  or  users  could  know  how  well  they  were 
doing  in  some  absolute  sense.   Although  this  work  provides  some  insight 
to  these  serious  problems,  no  claim  can  be  made  about  the  perfection  of 
the  results  as  spelled  out  in  the  open  questions  listed  in  Chapter  6. 

For  notational  simplicity,  we  will  assume  all  logarithmic 
functions  take  their  ceiling  values  throughout  this  thesis. 


2..      TIME  AND  PARALLEL  PROCESSOR  BOUNDS  FOR  LINEAR 
RECURRENCE  SYSTEMS  WITH  ARBITRARY  COEFFICIENTS 

2.1  Introduction 

One  of  the  most  difficult  aspects  of  speeding  up  ordinary  programs 
for  parallel  or  pipeline  computation  is  the  treatment  of  recurrence  relations 

[7],  [8].    Although  they  occur  with  rather  low  overall  frequency,  an 
annoyingly  large  fraction  of  the  programs  one  sees  contain  at  least  one 
recurrence  relation.   Most  of  these  are  linear  recurrence  relations. 

Suppose  some  multidimensional  computation  space  contains  n  points 

over  which  a  general  linear  recurrence  relation  is  defined.  We  will  show 

o 
that  this  computation  can  be  speeded  up  by  a  factor  of  ©((n/loggn)  )  using 

3 
0(n  )  processors.  More  generally  stated,  we  consider  the  parallel  solution 

of  any  linear  system  of  the  form  x  ■  c+Ax  where  A  is  a  strictly  lower  tri- 
angular matrix  and  c  is  a  constant  column  vector.  While  a  serial  machine  can 

2 
solve  this  system  straightforwardly  in  0(n  )  steps,  our  algorithm  finds  the 

2  3        2 

solution  in  0((log_n)  )  time  steps  using  at  most  n  /68  +  0(n  )  processors. 

Further,  all  processors  perform  the  same  operations  at  the  same  time  (SIMD 
operation).   These  matters  are  discussed  in  Section  2. 

In  Section  3  we  consider  the  restricted  problem  of  solving  an  m-th 
order  linear  recurrence  relation.  We  show  that  given  any  linear  system  of  the 
form  x=c+Ax  where  A  is  a  strictly  lower  triangular  matrix  of  bandwidth  m,  its 


solution  can  be  speeded  up  by  a  factor  of  O(mn/log  mlog  n)  using  at  most 

p 
0(m  n)  processors.   Again,  this  result  is  achieved  with  uniform  operations 

in  all  processors  (SIMD  operation) . 

These  results  are  far  from  ideal  in  several  respects.   First  we 

observe  that  in  the  general  case  a  parallel  computation  time  of  O(logpn) 

could  be  hoped  for.   This  follows  from  a  simple  fan-in  argument  using  the 
0(n  )  data  elements  in  the  A  matrix  and  c  vector.   Similarly,  in  the  m-th 

order  case  we  could  hope  for  a  parallel  computation  time  of  O(log  mn). 
Note  that  we  approximate  this  when  n  »  m. 

A  second  shortcoming  of  these  bounds  is  that  the  number  of  pro- 
cessors required  is  rather  high  so  that  rather  low  efficiencies  result. 
This  leads  us  to  present  in  Section  k   some  practical  modifications  to  the 
algorithms  of  Sections  2  and  3-   These  modifications  allow  much  higher  effi-  j 
ciencies  without  significantly  reducing  the  speedups . 

The  parallel  evaluation  of  recurrence  relations  has  been  studied 
by  a  number  of  people.  In  [9],  Karp  et.al.  discuss  the  formalization  of 
the  problem  and  prove  some  results  about  the  resulting  graphs .  Muraoka  in 
[10]  discusses  algorithms  for  speeding  up  general  classes  of  recurrence 
relations  and  also  presents  bounds  of  0(log  n)  time  steps  on  a  number  of 


simple  recurrences.   Kogge  and  Stone  [11]  discuss  m-th  order  recurrences 
and  while  they  do  not  give  a  general  processor  bound,  their  time  bound  is 

about  twice  ours  for  this  case.   Heller  has  studied  general  recurrence 

2 
relations  and  proves  that  they  can  be  evaluated  in  0(log«n)  steps  with 

4 
0(n  )  processors  using  a  machine  capable  of  simultaneous  addition  and 

multiplication  (MIMD  operation) .   He  recently  has  improved  the  processor 

3 
bound  to  0(n  )  while  the  time  bound  remains  twice  that  reported  here  [12], 

Although  we  do  not  discuss  any  details  of  machine  organization 

in  this  paper,  it  is  in  order  to  sketch  a  machine  here.   First,  we  list 

some  idealizing  assumptions. 

1)  Instructions  are  always  available  for  execution  as  required 
and  are  never  held  up  by  a  control  unit. 

2)  Each  operation  takes  the  same  amount  of  time,  which  we  refer 
to  as  a  unit  step. 

3)  Any  number  of  processors  and  memories  may  be  used  at  any 
time. 

4)  There  are  never  any  accessing  conflicts  in  the  memory. 

5)  No  time  is  required  to  communicate  data  between  the  proces- 
sors and  memories. 

In  a  properly  designed  system,  for  steady  state  data  and  control 
flow,  standard  pipelining  techniques  may  be  used  to  approximate  1  and  2. 
We  will  present  theoretical  and  practical  bounds  on  the  number  of  proces- 
sors required,  so  3  can  be  removed.   It  will  be  noted  that  the  algorithms 
of  this  paper  mainly  involve  various  kinds  of  vector  and  matrix  products. 
References  [2]  and  [4]  show  various  schemes  for  accessing  rows,  columns, 


diagonals,  square  blocks,  etc.,  without  conflicts  using  a  single  memory 
storage  scheme.   Such  a  memory  should  provide  a  good  approximation  to  4. 
A  pipelined  alignment  network  can  be  used  to  approximate  5  and  the  details 
of  one  possibility  are  given  in  [13] •   Finally,  we  note  that  for  the 
algorithms  we  present,  simultaneous  multiplication  and  addition  (MIMD 
operation)  would  allow  us  to  further  reduce  our  processor  bounds,  but  as 
our  algorithms  are  presented  all  processors  execute  the  same  operation  at 
the  same  time  (SIMD  operation) .   Thus  we  feel  that  our  results  could  be 
of  direct  use  in  the  design  of  real  machines. 

The  following  definitions  will  hold  throughout  the  paper.   Let  T 

be  the  time,  measured  in  unit  steps,  required  to  perform  some  calculation 


using  i  independent  (parallel  or  pipeline)  processors.   We  define  the 

T 

T 


Tl 
speedup  of  a  p-processor  machine  over  a  uniprocessor  as  S  =  ■=—  ,  and  we 


Tl 
define  its  efficiency  as  E  =     _<  1,  which  may  be  regarded  as  the  quo- 

P   P  P 
tient  of  S   and  the  maximum  possible  p-processor  speedup  p.   For  various 

results  in  Sections  2  and  3  we  will  bound  p,  T  and  S  .   In  Section  4  we 

P      P 

will  discuss  these  further  and  introduce  E  values  as  well. 

P 


2 .2  General  Linear  Recurrences 

In  this  section  we  discuss  bounds  on  the  time  and  processors  for 
evaluating  general  linear  recurrences.   The  main  result  is  Theorem  1.   We 
also  give  Algorithm  1  which  may  be  used  for  the  parallel  evaluation  of  a 
general  linear  recurrence  system. 

Some  general  notations  will  be  followed  throughout  this  paper 
unless  otherwise  specified.   We  use  upper  case  letters  to  denote  square 


i 


matrices,  lower  case  letters  to  denote  scalars,  and  lower  case  letters 

with  overbars  to  denote  column  vectors.   Given  any  n  by  n  matrix  A 

=  [a..]    ,  A,  denotes  the  sub-matrix  consisting  of  all  elements  a.,  for 
ij  n*n   Tc  ij 

k<  i  _<  n,  k<j<^n,  and  a,  denotes  a  vector  of  the  last  n-k  elements 
of  the  k-th  column  of  A,  i.e., 

ak  =  (ak+l,k,\+2,k'"*,an,k)   " 

Definition  1         A  general  linear  recurrence  system  of  n  equations, 
R<n>,  is  defined  as 

R<n>:   x.  =  0  for  i  <  0, 

1-1 

=  c.  +  Z       a.  .  x.        for  1  <  i  <  n  . 

i    .  1   ij  3  ~       ~ 

In  matrix  notation  we  write  this  as 

x  =  c  +  Ax 
where 

x  =  \x-i  >  xp>  •  *  •  j  xn'  >         c  =  \cn  y  C2'  ' '  ' y  °n   ' 

and 

A  =  Ta.  .1     with  a.  .  =  0  for  i  <  j. 

'  ij  nxn        i j  - 

For  simplicity,  we  will  assume  n  is  a  power  of  2  for  the  remaining  dis- 
cussion.  If  it  is  not,  the  system  can  be  augmented  and  the  results 

applied  directly.   We  refer  to  x  as  the  solution  vector  of  R<n> .   If  the 

—  — *     * 

solution  vector  x  of  R<n>  is  equal  to  the  solution  vector  x  of  R  <n>, 

* 

we  say  that  R<n>  =  R  <n> . 


i 


10 


Definition  2  Given  any  general  linear  recurrence  system 

R<n> :   x  =  c  +  Ax  , 
we  define  a  set  of  n-1  systems  {R.<n-j>|l  <_  j  <_  n  -  1} 

where 

R.<n-i>:   y.  =  a.  +  A.y. 
J         J    J    J  J 

in  which  y.  is  the  solution  vector  of  R.<n-j>.   The  set  of  all  solution 
J  J 

vectors  y.,  1  £  j  _5_n  -  1,  can  be  written  in  matrix  form  as  a  strictly 
lower  triangular  matrix  Y  =  [y. .     in  which 

y.  .  =  0  for  i  <  j  , 

ij  - 

i-1 
=  a.  .  +   E   a.,y,  .     for  1  <_  j  <  i  <_  n  . 
1J   k=j+l  lk  k3 

As  an  example  of  these  definitions,  consider  the  following 
linear  recurrence  system  with  n  =  4. 


Example  1 

Given  R<4> :   x,  =  c. 


x2  =  c2  +  a21  X;L 


X3   C3  +  a31  Xl  +  a32  X2 


XA   c4  +  a41  xl  +  a42  x2  +  a43  X3 


we  have 


11 


A  = 


21 


31    32 


0 

0 

0 

0 

a7A 

0 

%1      %2       %3 


0 
0 
0 
0 


L 


x   -  [X--.,     Xp,  x  ,     x<  )       , 


-         VCn  >      Cp>      co>   Ch  ' 


In  this  case  there  are  three  corresponding  systems  R.<n-j>  for  1  <  j  <  3, 
written: 


Rx<3>: 


y21  =  a21 


y3l  =  a3l  +  a32  y2l 


y^l  =  \l  +  %2  y2l  +  %3  y3l  ' 


R2<2>: 


y32  "  a32 


and 


Thus  we  have 


y^2  ~  %2   +   aJ+3  y32> 


R5<1>:      ^1+3  =  %$    ' 


Y  = 


0 

0 

0 

0 

y21 

0 

0 

0 

y31 

y32 

0 

0 

Jkl 

Yk2 

Jk5 

0 

where  y.  is  the  solution  vector  of  the  corresponding  system  R.<n-j>  for 

J  J 


1  <  3  <  3. 
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[.emma  1  Given  any 

R<n>  :   x  =  c   +  Ax  , 
there  is  a  corresponding 

R*^i>:   x  =  c  +  Yc  -  (I+Y)c  , 
such  that  R<n>  =  R  <n>- 

Proof:    Our  proof  is  by  induction  on  n.   The  cases  n  =  1  and  n  =  2  are 
trivial  so  we  will  examine  n  =  3  as  our  basis.   If  n  =  3,  we  have  from 
Definition  1 

R  <3>:  xx  =  c 

X2  =  C2  +  a21  Xl  =  °2  +  a21  Cl 

x3  =  c^  +  a31  x±   +  a^  x2  =  c3  +  a^  c±   +  a^  (c2  +  a^  c^ 

The  corresponding  system  has  the  form 
R*<3>:   x  =  c 

x2  =  c2  +  y21  C]_ 

x3  =  c3  +  y3l  °i  +  y32  °2  ' 
Now,  by  Definition  2  we  know  that 

y21  =  a21  (1) 

y3l  =  a31  +  a32y21  =  a3l  +  a32a2l  (2) 

and      y   =  a   .  (3) 

By  substituting  Equations  (1),  (2),  and  (3)  into  R  <3>,  we  get 

R*<3>:  xx  =  c± 

X2  =  C2  +  a21Cl 

x3  =  c3  +  (a3l+a32a2l)cl  +  a32c2  =  c3  +  a3lCl  +  a32(c2+a2lc].)- 
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By  comparing  this  system  with  the  final  righthand  side  obtained  above 
for  R<3>  ,  we  see  that  R<3>  =  R*  <3>  ■ 

Now  as  our  induction  hypothesis,  we  assume  the  lemma  is  true  for 
n  =  k,  and  prove  it  for  n  =  k  +  1.   That  is,  given  any  R<k>,  we  assume  there 

is  an  R  <k>  =■  R<k>  where  R  <k>  can  be  written  as : 


\ 


y21  i 
y3i  y32  x 


•     ♦ 


•     • 


ykiyk2  •'  yk,k-l  \ 


(k) 


Any  system  R<k+1>  has  one  additional  equation  whose  form  by  Definition  1 
is 

*k+l  =  Ck+1  +  V-l.l  Xl  +  ^+1,2  X2  +  '"   +  ak+l,k  *k 


:  Ck+1  +  (ak+l,l'  ak+l,2'  •*•'  ak+l,k} 


\X,  >   X»  ,   •  ■  •  j   ^T»' 


(5) 


If  we  substitute  the  righthand  side  of  Equation  (4)  into  (5),  we  get 
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xk+l  "  °k+l 


+  (Vl,l'  Vl,2 


)     *  •  •  > 


Vl,k) 


r21 


31   J32 


y^o     i 


•     • 


•        • 


y 


y, i 


kl   ,yk2  *  '  -%k-l 


cl] 


°2; 


v.      ) 


By  the  definition  of  yk+1   for  1  <_   j  £  k  (see  Definition  2)  ,  this  can  be 

written  as 


*k+l  =  Ck+1  +  (yk+l,l'  yk+l,2^'-^yk+l,k) 


^ 


(6) 


Now 


by  combining  Equations  (4)  and  (6) ,  we  can  write 


x  = 


xl 

1 

r—               — > 
Cl 

X2 

y21     i 

C2 

x3 

y31     y32 

1                                  , 

c3 

• 

•                          • 

•                                                     1 

m 

. 

= 

• 

•                        •                                            ' 

• 

- 

• 

• 

•                            • 

m 

\ 

ykl      *    *    * 

yk,k-l         1         | 
1- 

°k 

L\+l 

_yk+l,l    '    • 

'     yk+l,k-l  yk+l,kl    1 

_Ck+l 

(I+Y)'c 


Thus,  the  lemma  holds  for  n  =  k  +  1,  and  therefore  for  any  integer  n. 


Q.E.D. 


Before  giving  the  next  lemma,  further  definitions  are  needed. 
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Definition  ^  Let  f  be  a  function  of  a.  ,  where  1  <  i  <  n  and  0  <  j  <  n-1 

and  f  be  a  function  of  a.  ,  where  p+1  <  i  <  p+n  and  p  <  j  <  p+n-1.   Let  f ' 

be  derived  from  f  by  replacing  each  a. .  in  f.  by  a.      .  We  sav  that  f 
-l  ij     l     l+p,  j+p       *  2 

is  a  p-shifted  function  of  f^  if  f»  =  f   and  we  express  this  symbolically  by 
writing  f,   *  f  . 

Definition  ^     For  any  R  <  n  >,  and  any  ordered  pair  (i,  j),  0  <  j  <  n-1, 
1  <  i  <  n,  we  define  a  sequence  of  functions  f,(i,  j),  f„(i,  j),  f.(i,j ),..., 

f  (i, j)  such  that 

j+2r-l 
f _  (i,j)  =  f  (1,3)  +   2    f  (i,k).f  (k,a)   for  r=2m,  0<  m  <  t-1, 
dr        r       k=j+r 
with  the  boundary  conditions 

f^i,  j)  =  a     for  i  >  J, 


in  which 


f  (i,  j)  -  0     for  i  <  j, 


a._  =  c.   for  1  <  i  <  n. 
iO    i        —   — 


For  any  given  f  (i, j),  if  we  apply  repeatedly  the  recurrence 

definition  to  express  f  (i, j)  in  terms  of  f  ,  f  ,...,   and  finally  of  f  , 

2   IT 

we  can  express  f  (i, j)  as  a  function  of  a. .'s  only.   In  the  following,  we 
will  assume  this  final  expression  for  any  f  (i,j)  unless  otherwise  specified. 

Some  properties  of  the  f  function  can  be  observed  directly  from 
the  above  definition: 

Property   i)   fgr(i,j)  =  fr(i,j)   for  i-j  <  r. 

P 
Property  ii)   fr(i,j)  ■•  fr(i+p,j+p)  • 
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Property  iii)   If  g  is  derived  by  replacing  each  a   in  f  (i,,j) 

lK     r 

by  a±+p  k,  then  g  -  fr(i+p,j),  for  i-,j  >  r. 
To  illustrate  Property  ii)  and  Property  iii),  let  n  =  8,  i  =  k,    j  -  0,  r  -   2, 
p  =  1.   Definition  k,    Property  ii)  and  Property  iii),  respectively,  give  us: 
f2(M)  =  ^(M)  +  f1(4,l)-f1(l,0)  =  a^Q  +  a4la1()  , 

f2(5,l)  =  ^(5,1)  +  f1(5,2)'f1(2,l)  =  a51  +  a52a21  , 
and       fg(5,0)  =  fx(5,0)  +  f1(5,D'f1(l,0)  =  a5Q  +  a51a10  . 

Wow  we  will  prove  a  lemma  which  forms  the  basis  of  the  remaining 
theorems . 


Lemma  2   Given  any  R  <n>,  we  have 

i-1 

f  (i,0)  =  a.  _  +  Z  a.  .x.  =  x. 
nv  '  xO    .  ,   ij  J    i 


for  1  <  i  <  n  . 


Proof; 

Our  proof  is  by  induction  on  n.  As  a  basis,  the  lemma  is  obvious 
for  n  =  2.  As  an  induction  hypothesis,  we  assume  it  holds  for  n  =  s.   Thus, 
by  hypothesis  we  have 


i-1 

f  (i,0)  =  a.~  +  Z  a.  .x.  =  x.    for  1  <  i  <  s. 
sv  '      lO     1  •  X3  J    i        ~   " 

Now  we  prove  the  lemma  for  n  =  2s  and  consider  two  cases. 

Case  1    1  <  i  <  s. 

We  know  from  Property  i)   and  Equation  7 

f0    (1,0)   =   f   (i,0)   =  x.        for  1  <  i  <  s. 
2s      '  s     '  i  —       — 


(7) 
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Case  2    s+1  <  i  <  2s. 


From  Equation  7  we  have 

s-1 
f  (s,0)  =  a  -  +  Z  a  .x.  =  x   for  1  <  i  <  s. 

SV  '  '     BO    ,  ,   B]  3     S  - 

By  applying  Property  iii)  to  the  above  f  (s,0)  we  can  get 

s 

s-1 

f  (s+k,  0)  =  a  ^  _  +  Z  a  .  .x.  for  1  <  k  <  s. 
sv   '      s+k,0       s+k,  j  j  - 

J 


(8) 


Having  these  values,  the  equations  for  x , ,  s  <  i  <  2s,  of  R<2s>  can  be  ex- 
pressed in  matrix  form  as  Equation  9« 


X 

s 

/ 

Xs+1 

Xs+2 

= 

• 

• 

• 

wX2s  . 

>. 

ffs(B,0)      ■ 

f   (s+1,0 ) 


f  (s+2,0) 

s  '     ' 


fg(2s,0) 


s+1,  s        0 


!>    .«  a     ~        ,        0 

s+2, s  s+2, s+1 


fca2s,s  a2s,s+l   *    *    *      a2s,2s-l     ° 


By  Lemma  1,    we  know  that  Equation  9  is  equivalent  to 


X 

s 

Xs+1 

• 

Xs+2 

• 

X2s 

L      J 

(9) 


^s+1 


Ls+2 


L2s 


s+1,  s 


y 


s+2,s 


y 


2s,  s 


s+2,  s+1         1 


y 


y. 


2s, s+1   .    .    .      ,y2s,2s-l         1 


fs(s,0)  " 
f   (s+1,0) 


fg(s+2,0) 


fa(2s,0) 


(10) 


where  by  Definition  2  we  have 


i-1 


y.  .  =  a.  .  +       Z         a.,    y,  . 

U  13       k=J+1       ikykj 


for  s  <  j   <  i  <  2s. 


(11) 
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Now  consider  the  equations  for  x  ,  1  <_  i  <_  s ,  of  R<2s>.   If  we 


relabel  all  the  x.  as  x.  n,  we  obtain  a  system  of  the  form 

1  1  y   ^ 


i-1 


xi,o  =  ai,o  +  ^  aik"k, 


k=l 


0  ' 


1  <  i  <  s. 


(12) 


From  Equation  11  and  Equation  12  it  follows  by  Definition  3  that 


Xi-j,0  "*  yi,j  ' 
and  from  Equation  7  we  have 


s  <  j  <  i  <  2s, 


(13) 


f  (i-j,0)  =  x.  .  =  x.  .  n, 


1  <  i-j  <   s. 


(i*0 


Now  by  Property  ii)  we  have 

f8 (l-3,o)  I    ffl(l,3)  > 

so  by  Equation  Ik 


(15) 


Thus  we   see   from  Equations   13  and  15  that  y.  .   and  f   (i,  j)   are  both  j -shifted 

13      s 

functions  of  x.  .  _.  and  it  follows  that 
1-3,0* 


yld  =  ffl(i,d), 


s  <  j  <  i  <  2s.  (16) 

By  substituting  the  right  hand  sides  of  Equation  16  into  Equation  10  we  get 


s+1 


s+2 


^2s 


"1  • 


fs(s+l,s)     1 


f  (s+2,s)     f  (s+2, s+1) 

S  S 


f  (2s, s)      f  (2s, s+1) 
s  s 


f  (2s,2s-l)    1 
s 


fs(s,0) 

fs(S+i,o: 


f  (s+2,0 

s 


f  (2s,0) 

s 


(17) 
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and  by  Definition  k, 

=  (fs(s,0),  f2g(s+l,0),  f2s(s+2,0),...,f2s(2s,0))t.         (18) 

Hence  we  have 

f2s(i,0)  =  x.,   S+l  <  i  <  2s. 

Cases  1,2  together  show  that  the  lemma  holds  for  n  =  2s  and  hence 

Q.E.D. 
for  any  n. 

Now,  we  can  express  our  parallel  algorithm  for  evaluating  R<n>  in 
terms  of  iterative  computations  of  tJL±,0)   =  x±  for  1  <  i  <  n.  It  follows  from 

Definition  k   that  each  f  (i,0),  1  <  i  <  n,  can  be  expressed  as  a  function  of 

f  ,  f  ,  ....  f,  recursively.  Hence,  we  can  evaluate  them  in  the  reverse 

n  '   n  '        '      1  ' 

2    k  \ 

order  simultaneously  for  all  i.   By  Property  i)  of  the  function  f,  the  evalua- 
tion of  f  (i,0)  =  x.  will  terminate  on  the  (log0i)-th  iteration  and  hence  it 
n  '      i  2 

takes  at  most  logun  iterations  to  complete  the  evaluation  of  the  whole  R<h> 
system.   This  is  illustrated  in  the  example  of  Figure  1. 

To  carry  out  this  evaluation  scheme  for  any  given  R<n>  system, 
we  present  in  the  following  a  computational  algorithm,  in  which  identical 

function  values  (such  as  f  (2,0),  f  (3,0)  in  Figure  l)  at  the  same  level  of 

computation  will  be  evaluated  only  once. 

Algorithm  1 

a.  Let  B  be  a  lower  triangular  matrix  of  order  (n  x  n)  in  which 

the  j-th  column  contains  a.  .  n  for  j  <  i  <  n   where  a.   =  c. . 
u  i,j-l      ""   "~  io    i 

b.  Let  C  be  an  alias  for  B,  i.e.,  B  and  C  represent  the  same  memory 
locations . 

c.  Repeat  this  step  for  i  =  1, 2,  . . ., log  n; 

i:  Set  k  =  21; 

ii:   Partition  B  and  C  as  shown  in  Figure  2; 

ili:   Compute  S.  =  S.  +  T.  *  Q.   for  i  <  j  <  n/k  simultaneously. 
J     j     3  3        _ 
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d.   The  first  column  of  B  contains  the  solutions  x   for 

1  <_  i  <_  n. 

The  effectiveness  of  this  algorithm  for  evaluating  R<n>  can 
be  easily  demonstrated  by  the  fact  that  at  the  beginning  of  the  i-th 
iteration  any  element  b    in  any  of  partitions  S.,  T.,  or  Q.,  for 

p,q  j   j    j 

1  <_  j  <_   n/k,  represents  the  value  of  f,  ,.(p,q-l);  and  at  the  end  of 

that  iteration,  any  element  in  S.  represents  f,  (p,q-l).   The  example 

shown  in  Figure  1  relates  the  functional  notation  f  to  the  matrix 
notation  of  Algorithm. 

Note  that  at  the  end  of  the  i-th  iteration  of  Algorithm  1,  the 
first  2   elements  of  the  first  column  of  B  contain  the  solutions 

fx.  Il  <  j  <  2  }  .   We  will  illustrate  this  as  well  as  the  algorithm  in  the 

J 

following  example. 

Example _2        Given  R<8>  as  follows. 

x1  =  2  j 

x2  =  1  +  x1 

x^  =  1  +  2x1  +  x2 

\  =  k   +  xx  +  3xp  +  Sx^ 

x  =  2  +  3x1  +  3x,p  +  2x  +  3x^ 

x6  =  3  +  xl  +  x2  +  x5  +  \  +  x5- 

Xy  =  1  +  2x..  +  2x_  +  x,  +  x,  +  kr  +  2xA 

I  J-       d  J)      4       9       O 

x8  =  3  +  x1   +  x2  4-  2x3  +  2x^  +  x5  +  x6  +  x? 
Tne  solution  set  is  [2,    3,  8,  31,  126,  173,  930,  1285)  which  is 
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Figure  3.   Parallel  Evaluation  of  R<8> 
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the  same  as  shown  in  Figure  3.   In  the  figure  'x1  denotes  a  number  which 
is  no  longer  needed,  matrices  B  and  C  have  been  overlayed  as  a  single 
matrix,  and    ©  ,    ©   indicate,  respectively,  matrix  multiplicatiorf 
and  addition  specified  by  Algorithm  1. 

From  the  above,  we  can  obtain  a  theorem  for  general  linear 
recurrence  problems.  In  what  follows,  we  will  write  log_x  to  denote 
flog?xT  for  notational  simplicity. 

Theorem  1  Any  R<n>  can  be  computed  in 

12    3 
T  <_  ■£  log2n  +  -  log2n  , 

n3 
with  P  £  ~  +   0(n  )  . 

2 

The  minimum  speedup  is  0( =— )  . 

log2n 


Proof:         By  Lemma  2,  x.  =  f  (i,0),  1  <  i  <  n.   But  from 
J  l    n         —   — 

Definition  4,  each  f  (i,0)  can  be  expressed  in  terms  of  f  ,  f  ,  ...,  f , • 

n  n   ii        1 

2   4 

Each  level  of  the  expansion,  e.g.,  from  f,  to  f,   increases  the  computa- 

2 

tion  time  by  one  multiplication  and  one  summation  of  ("T  +  1)  operands; 

k 
the  summation  can  be  done  in  at  most  log_(-^-  +  1)  steps  which  is  less  than 

or  equal  to  log_k  steps.   Thus,  for  a  total  of  log„n  levels,  we  have 

log2n       log2n     12    5 
T  <    Z  .1  +    E   1  =  -  log2n  +  g-  logon. 

y  j=l         ,1=1      d  d 
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Since  the  serial  computation  time  is  n(n-l),  the  speedup  S  is 


n(n-l) 


S  >   1,  2    3n 

P  -   g  g?n  +  2 log2n 


0( 


n  2 
log?n 


For  the  processor  count,  we  refer  to  Algorithm  1  and  Figure  2. 
The  maximum  number  of  processors  are  required  at  the  multiplication  time 
for  each  iteration  i.   Hence,  we  will  count  the  total  number  of  multiplica- 
tions for  T.  *  Q.,  1  <  j  <  n/k,  as  the  required  number  of  processors  p(?  ) 

J    J 

for  a  particular  iteration  i.   It  can  be  easily  observed  from  Figure  2   that 


p(k=2i)  = 


[k/2 

E  J  + 


k (n-k) 


♦* 


k/2 
1)   I  j  +  f  (k  +  2k  + 


.  +  (n-2k) 


for  1  <_  i  <_  log2n  . 


For  large  n,  p(k)  takes  its  maximum  value  at  k  =  n/A  and  gives 

p(n/4)  =  (r^   n  +  I6n  +  8n)/128;  hence  the  maximum  number  of  processors  is 
8 


n3/68  +  0(n2) 


Q.E.D. 


We  close  this  section  with  several  observations  about  Theorem  1. 
If  multiplication  time  is  significantly  greater  than  addition  time  on  some 
computer  it  is  worthwhile  to  note  that  only  log„n  of  our  unit  time  steps  are 
multiplications,  the  rest  being  additions.   We  also  note  that  the  processor 
bounds  can  be  further  improved  by  delaying  the  evaluation  of  some  smaller 
trees,  e.g.,  those  for  x.  and  x0  in  Figure  1.   However,  as  the  algorithm 
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stands,  all  processors  execute  the  same  operation  at  the  same  time  (SIMD 
organization)  and  by  delaying  some  trees  we  would  require  additions  and 

multiplications  at  the  same  time  (MIMD  organization) .   As  our  algorithm 

3 
stands  we  need  about  n  /68  processors  (in  an  SIMD  organization)  and  for 

8  2 

n  <^  2   this  could  be  regarded  as  0(n  )  processors.   Also  note  that  in  a  very 

straightforward  way,  we  can  apply  the  algorithm  and  results  to  the  solution 

of  any  triangular  linear  system  of  equations  Ax  =  b .   This  is  done  by 

transforming  the  system  Ax  =  b  to  the  system  x  =  c  +  Bx  in  which 

c .  =  b ./a. .         for  1  <  i  <  n 
l    i   n  —   — 

and    b..  =  -a.. /a..      for  1  <  j  <  i  <  n  , 
lj     lj   li  -  J     - 

=  0  for  i  <  j  . 

2. 3  M-th  Order  Linear  Recurrences 

In  this  section  we  discuss  bounds  on  the  time  and  processors  for 
evaluating  m-th  order  linear  recurrences.   The  main  result  is  Theorem  2. 
We  also  give  Algorithm  2  which  may  be  used  for  the  parallel  evaluation  of 
an  m-th  order  linear  recurrence  system. 

Definition  6         An  m-th  order  linear  recurrence  system  of  n  equations, 
R<n,m>,  is  defined  as 

R<n,m>:   x.  =  0  for  i  <_  0  , 

i-1 

=  c.  +   Z   a. .  x.     for  1  <  i  <  n  , 
1    •  •    !J   3  —   — 

j=i-m 

where  1  <  m  <  n.   In  matrix  notation  we  write  the  above  as  x  =  c  +  Ax  where 
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A  is  a  strictly  lower  triangular  band  matrix,  i.e.,  a   =  0  for  i  <  j  or 

i  -  j  >  m.   For  simplicity  we  will  assume  n  and  m  are  powers  of  2.   If  n  is 
not,  we  can  augment  the  system,  and  if  m  is  not,  we  choose  the  smallest 
power  of  two  greater  than  m  and  apply  our  results  directly.   Since  we  are 
now  dealing  with  a  band  matrix,  we  are  able  to  prove  the  following  lemma. 

Lemma  3  For  any  R<n,m>,  m  <_  s  <_  n/2,  we  have 

i-1 

f2s(i'0)  =  fs(i>0)  +   Z       fs (i,k)'f,(k,0)   "for  s  <  i  <  s+m,    (19) 

k-s 

=  §  +  £    "  f  (i,k)'f  (k,0)        for  s+ra  <  i  <  2s,   (20) 
k=s 

where  g  is  derived  by  replacing  each  a.   ^  .  in  f  (i,s+m-l)  bv  a 

k.,  s+m-1     s  '  kO 

Proof:         Since  we  are  only  concerned  about  i  j^  2s,  it  is  sufficient  to 
consider  the  first  2s  equations,  i.e.,  x.  for  1  £  i  <_  2s  as  an  R<2s,m>  system 

which  is  a  special  case  of  an  R<2s>  system. 

Now,  the  equations  for  x.,  s  <  i  <  2s,  can  be  recalled  from  Equa- 
tion 17  (derived  from  Equation  9  and  Equation  10),  and  the  equivalent  relation 
stated  in  Equation  18.   However,  by  Equation  8  and  the  property  that  a. .  =  0 
for  i  -  j  >  m,  we  have 

f  (s+k, 0)  =  a  for  m  <  k  <  s  .  (21) 

S  S  '&*  vJ  mmt         ™" 

By  substituting  Equation  21  into  Equation  17,  and  decomposing  the  x  vector 
and  its  corresponding  f  vector  into  an  m-vector  and  an  (s-m+l) -vector,  it  is 
obvious  that  the  first  m-vector  will  confirm  Equation  19 •   Furthermore,  the 
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second  (s-m+1) -vector  shows  that 


f_  (s+m,0) 
zs 

f   (s+m+1, 0] 


f2s(2s,0) 


s+m 


x 


s+m+1 


2s 


f  (s+m,s) 
s 

f  (s+m+1, s) 
s 


f  (2s, s) 


f  (s+m,s+m-l) 
f  (s+m+1, s+m-1) 


f  (2s, s+m-1) 


fs(s,0) 
fg(s+l,0) 


f  ( s+m-1, 0) 


+  b 


where  b  = 


f  (s+nH-l,s+m)    1 

f  (s+m+2,s+m)    f  (s+m+2, s+m+1)    1 


f  (2s, s+m)      f  (2s, s+m+1)   . 


s+m,0 
s+m+1 , C 


2s, 0 


(22) 


By  Equation  16,  we  can  rewrite  b  as 


b  =  Z 


s+m,  0 


s+m+1,  0 


s+m+2,  0 


2s,  0 


where  Z  = 


y  s+m+1,  s+m 


y 


s  +m+2 ,  s  +m  y  s  +m+2 ,  s  +m+l 


y, 


2s,  s+m 


y' 

2s,  s+m+1 


(23) 


29 


However,  from  Equation  16  and  Lemma  1  we  have 


f  (s+xn,  s+m-1 


f  (s+m+1,  s+m-1 ) 

s      ' 


f  (2s, s+m-1) 
s 


y 


s+m,  s+m-1 


y 


s+m+l,  s+m-1 


y, 


2s,  s+m-1 


=  Z  . 


a 


s+m,  s+m-1 


s+m+1,  s+m-1 


2s,  s+m-1 


(24) 


Note  that  each  a.     -  for  s+m  <  k  <  2s  in  Equation  24  has  a 
k, s+m-1        —   — 

one-to-one  corresponding  a,  _  in  Equation  23,  and  that  from  Definition  2 

there  are  no  a.  .  for  0  <^  j  <_  s+m  in  any  element  of  Z.   Hence,  Equation  22 

through  Equation  24  together  prove  Equation  20  in  the  lemma. 

Q.E.D. 

By  using  Property  ii)  and  Property  iii),  we  can  easily  extend  the 
lemma  as  follows. 


Lemma  k 


For  any  R<n,m>,  m  <  s  <  n/2,  we  have 


i-1 


f_  (i,  j)  =  f  (i,  j)  +   Z  f  (i,k)*f  (k,  j)   for   s  <  i-j  <  s+m  , 


"2s 


k=j+s 

j+s+m-1 
+   Z      f  ,(i,k)-f  (k,j) 
k-j+s 


for  s+m  <  i-j  <  n  , 


where  g  is  derived  by  replacing  each  a. 


k,j+s+m-linfs(i^+S+m-l)byakr 


The  parallel  algorithm  for  evaluating  R<n,m>  is  essentially  the  same 

as  that  for  general  case,  i.e.,  for  each  f  (i,  0)  =  x. ,  1  <  i  <  n,  we  express  it 

in  terms  of  f  ,  f  ,  . . .  f,  recursively,  and  then  evaluate  them  in  the  reverse 
n/  n'     1  "" 

2  k 
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order  simultaneously  for  all  i.   The  main  difference  between  this  case  and  the 
general  case  is  that  the  recurrence  definition  stated  in  Lemma  4  will  be 


used  instead  of  Definition  4  for  the  expansion  of  f 0  ,   m  <  s  <  n/2.  As  a  result, 

the  computation  time  introduced  at  each  level  of  expansion  from  f  to  f  is  a 

n     m 

constant  instead  of  a  variable  increasing  with  the  levels  of  expansion. 

Similarly,  to  carry  out  this  evaluation  scheme  for  any  given  R<h,  m> 
system  we  can  establish  the  following  computational  algorithm. 

Algorithm  2 

a.   Let  B  be  a  lower  triangular  matrix  of  order  (nxn)  in  which  the 


j-th  column  contains  a.  .  ,  for  j  <  i  <  n,  where  a.   =  c.  and  a. .  =  0  for 

1,3-1  °   -   -  10    i      ij 

i-j  >  m  initially. 

b.   Let  C  be  an  alias  for  B  i.e.,  B  and  C  represent  the  same  memory 


locations. 


Repeat  this  step  for  i  =  1  ,  2,  .  .  . ,  logpn  ; 
i:   Set  k  =  21  ; 
ii:   Partition  B  (diagonally  shaded  areas)  and  C  (vertically 
shaded  areas)  as  shown  in  Figure  4(a)  (1  <^  i  <^  log„m)  and 

Figure  4(b)  (log92m  <_   i  <_  log_n) .   Note  that  matrices  B 
and  C  are  overlayed  as  a  single  matrix  in  the  figure; 

iii:   If  1  <  i  <  log  m, 

then  compute  S  =S .+T.*Q.   for  1  <  j  <  n/k  simultaneously, 

u     J     J     J 

else  compute  S.=S.+T.*Q.   for  1  <  j  <  n/k  , 

J      J      J      j  ~~       ™~ 

and  V.=V.+T  *U         f or  2  <  j  <  n/k  simultaneously 
J      J      J      J 
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d.   The  first  column  of  B  contains  the  solutions  x  for  1  <_  1  <_  n  . 

At  the  end  of  the  i-th  iteration  of  Algorithm  2,  the  first  2 

elements  of  the  first  column  of  B  contains  the  solutions  {x.|l  <_  j  <_  2    }. 

In  the  algorithm,  the  V  ,  2  £  j  <_  n/k,  contain  all  the  required  values  of 

function  g  as  specified  in  Lemma  4  which  will  be  used  during  the  next 
iteration.   We  illustrate  this  in  the  following  example. 


Example  3  Given  R<16,2>  as  follows: 

x±=  1  x9  =  3+x^Xq 

x2  =  2+Xl  x^  =  l+x8+x9 

x3  =  2+Xl+x2  xi:L  =  l+2x9+3x10 

xk   =  l+3x2+x3  ^  =  2+2x10+xi;L 

x5  =  2+3^+2^  x13  =  2+x11+2x12 

x6  =  ^+3%+2x5  xiU  =  1+X12+X13 

^  =  l+x5+x6  x15  =  5+2x13+Xl4 

xg  =  2+2x6+x?  xl6  =  ^+xlif+x15   .     ^  | 

The  solution  set  is  {1,3, 6, 16, 40,100, 141,3^3,  830, 1174,  5183,7533,20251,27785, 
68292,96081}   which  is  the  same  as  shown  in  Figure  5« 

Now,  we  can  state  a  general  theorem  about  this  class  of  problems. 

Theorem  2  Any  R<n,m>,  can  be  computed  in 

1    2 
T  <  (log2m  +  2)log2n  -  —  (log2m  +  log2m)  . 

3m2n 
with       p  <_  —7 —  +  0(mn)  . 

The  minimum  speedup  is  0(- = )  . 

log2m  log2n 
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Proof;  By  Lemma  4  any  fn(i,0)  ■  x.,  1  <_   i  <_  n,  can  be  expressed  in 

terms  of  f^,  f^,  ...,  f   .   Each  level  of  the  expansion  increases  the  computa- 
2       4 

tion  time  by  one  multiplication  and  one  summation  of  (m+1)  operands.   For  a 

total  of  log„(— )  levels,  we  have 
°2  m 

tx  <  (log2Cmfl)  +  1)  logg(£>  <  (log2m  +  2)  logg(£)  . 

To  express  f^  in  terms  of  f^,  f^,  ...,  f   we  use  Definition  A  for  ex- 

2   4 

pansion  as  in  the  general  case.  Hence,  as  stated  in  the  proof  of  Theorem  1 

we  have 

12? 

t2  5  2  1°S2in  4   tl0g2m   * 

Thus,    the  total  time  T     <  t     4    t   ,    i.e., 

Xr  -*-       C 

Tp  <  (loe2m  *  2)log2n  -  |(log2m  +  log^)  . 
Since  the  serial  computation  time  is  2mn-m(m+l),  the  speedup  S  is 

Q  v,  2mn-m(m+l)  ,         mn v 

P  "  /•,       o\-i       I/-,   2    _     v    log^m  log^n' 
(log2m  +  2)log2n  -  2(log2m  +  log2m)      &2    &2 


For  the  processor  count,  we  refer  to  Algorithm  2  and  Figure  h.     We 
will  count  the  total  number  of  multiplications  for  any  iteration  i  as  the 
required  number  of  processors  for  that  particular  iteration.   This  can  be 
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described  in  2  cases: 

(i)   2  <  21  <  m  (Figure  k&)f 


vl  k/2 


,(k=2i,  <  [x^fi.x)!]  £ 


J  + 


1  +(2f  -D| 


k(m-l) 
2 


k  k  .    "^"l  . 
0=1 


|g  (k+2)(n-k+2)+^(m-l)(n-m-k+2)+2in(m-k)   for  2  <  k  <  m, 


and   (ii)  2m  <  21  <  n  (Figure  Vb), 


p(k=2i)  =  \l   +  (g  -2)(m+l)J  ["  Z     j  +  m(|  -1)1 


(m+1) 


[m 
3=1 


j  +  m(-  -m) 


] 


for  2m  <  k  <  -  , 


and 


p(k) 


m  p 

v   •  ,    /n  \   mn   m  -m 

=  2  j  +  m-  (-  -m)  =  —  -  — — 

3=1 


l2 


for  k  =  n 


Since  p(k)  takes  the  maximum  value  at  k=2m  for  n>>m,  the  maxi- 
mum number  of  processors  p  is 

2                3     2  2 

3m  n  +  2mn  -  n   10m  -  2m  -  4m   3m  n  ,  n/      . 
P  1 4 1 =  -J—  +  0(mn)  . 

Q.E.D. 

As  was  true  with  Theorem  1,  Theorem  2  requires  only  log_n 
multiplication  steps.   Similarly  uniform  operations  are  performed  across 


all  processors  with  this  algorithm.   By  delaying  some  small  trees,  better 


37 


processor  bounds  could  be  achieved  at  the  expense  of  requiring  simul- 
taneous multiplications  and  additions.   The  algorithm  and  results  can  also 
be  used  in  the  solution  of  any  linear  system  Ax=b  where  A  is  a  triangular 
band  matrix  by  a  simple  transformation  as  stated  in  section  1. 

2.U  Practical  Implications 

Now  we  turn  to  a  discussion  of  some  practical  implications  of  our 
algorithms.  As  noted  earlier,  to  obtain  the  largest  theoretical  speedup 
achievable  by  Algorithms  1  and  2,  an  unreasonably  large  number  of  processors 
is  required.   However,  by  the  following  methods  substantial  speedups  over 
serial  machines  can  be  achieved  at  reasonable  efficiencies  with  greatly  re- 
duced numbers  of  processors. 

First  we  shall  discuss  two  methods  of  reducing  the  number  of  pro- 
cessors required.   Then  we  present  numerical  results  which  give  detailed 
estimates  of  possible  speedups  and  efficiencies  using  various  numbers  of 
processors.   Both  general  and  m-th  order  recurrences  are  discussed  and  results 
are  presented  for  ranges  of  m  and  n. 

Present  multi-operation  machines  perform  up  to  64  simultaneous 
bit  parallel  operations  (ILLIAC  IV)  or  up  to  several  hundred  simultaneous 
bit  serial  operations  (STARAN) .   Systems  under  discussion  allow  for  10,000 
or  more  bit  serial  processors  [14].   As  higher  levels  of  circuit  integra- 
tion become  practical,  it  may  be  possible  to  use  as  many  processors  in 
future  machines  as  the  number  of  integrated  circuit  packages  in  current 
machines.   Thus,  machines  with  several  hundred  thousand  processors  may 
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some  day  be  reasonable.   The  numbers  of  processors  presented  below  are 
for  this  reason  allowed  a  wide  range. 

In  modifications  of  both  Algorithms  1  and  2,  we  use  two  methods 
to  reduce  the  number  of  processors  required.   The  first  method,  which  we 
call  cutting  is  to  cut  an  nxn  A  matrix  into  [n/kl  vertical  strips  of 
width  k  and  process  just  one  strip  at  a  time.   We  carry  out  the  following 
for  k  =  r,  2r,  4r,  ...,  n,  with  r=2  for  general  recurrences  and  r=m  for 
m-th  order  recurrences. 

For  a  given  k  value,  consider  the  leftmost  k  column  strip  of 
the  A  matrix.   The  strip  may  be  regarded  as  a  triangular  kxk  system  T  (at 
the  top  of  the  strip)  and  a  rectangular  (n-k) xk  system  R  (at  the  bottom  of 
the  strip).   Furthermore,  each  of  the  [n/kl  strips  has  the  same  form,  with 
fewer  and  fewer  rows  in  successive  rectangular  systems.   Assume  that  p_ 

processors  are  required  to  solve  T  (the  value  of  p   can  be  determined  from 

Theorem  1).   Using  the  solution  of  T,  we  can  solve  R  by  simply  substituting 
these  values  into  each  row  and  computing  an  inner  product.   Assume  that 
this  required  p  processors  (actually  p_  =  k(n-k)).   Let  p  =  max(p  ,p  ) 

K  R  IK 

processors  be  chosen  to  compute  the  first  strip  as  well  as  the  remaining 
[n/kl  -  1  strips.   Clearly,  p  will  suffice  for  each  of  the  other  strips  if 
they  are  evaluated  in  two  parts  as  was  the  first  strip.   Using  this  p  value, 
we  obtain  overall  speedup  and  efficiency  figures  and  the  procedure  is 
repeated  for  all  values  of  k  =  r,  2r,  4r,  ...,  n. 

Note  that  we  could  achieve  better  speedup  results  by  increasing 
the  strip  width  to  occupy  all  p  processors  as  we  sweep  to  the  right.   This 
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was  avoided  to  simply  our  experiments  and  its  avoidance  would  simplify 
the  compiler  implementation  of  such  a  scheme  for  a  real  machine. 

The  above  procedure  generates  a  number  of  possible  speedups  for 
various  reduced  numbers  of  processors.   Next,  we  attack  the  problem  of 
increasing  the  efficiency.   Assume  we  have  a  tree  of  height  t  which  contains 

2  -1  operation  nodes  and  whose  evaluation  requires  2    processors.   The 

2fc-l   -  2 
efficiency  of  such  a  tree  evaluation  is  — — = =  —  .   By  halving  the 

2t   -t   fc 
number  of  processors,  only  one  extra  processing  step  is  required,  so  we  can 

2t-l   -   4 
achieve  an  efficiency  of  — —z =  — —  .   For  large  t,  this  approximately 

2t   (t+1)    t+1 
doubles  the  efficiency  with  a  negligible  decrease  in  speedup.   We  call  this 
process  a  fold  of  the  tree. 

After  i  folds  are  performed  the  tree  height  is  t+2   -i-2, 

while  the  number  of  processors  is  reduced  to  2   /2  .   Thus,  for  large  i 
the  product  of  (processors)  *  (parallel  time),  which  is  the  denominator  of 
the  efficiency  calculation,  changes  very  slowly  and  the  technique  is  of 
little  value.   In  practice  at  most  4  or  5  folds  proved  useful. 

In  the  implementation  of  our  algorithms  it  is  of  course  not  the 
case  that  a  single  tree  is  being  folded.   But  at  each  stage  of  the  calcula- 
tion a  number  of  trees  must  be  evaluated  in  parallel.   In  spite  of  the 
fact  that  the  trees  are  of  different  height,  our  procedure  folds  them  all 
uniformly.   This  would  simplify  a  compilation  algorithm  and  also  allows  all 
processors  to  perform  the  same  operations  at  the  same  time.   By  more 
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elaborate  scheduling  procedures,  greater  efficiencies  could  be  achieved. 

The  results  of  our  cutting  and  folding  experiments  are  sum- 
marized in  Tables  1  and  2.   In  both  cases  we  first  generated  a  number  of 
possible  results  by  cutting  and  folding.   Then,  by  using  the  processor 
count  intervals  shown  as  columns  in  the  tables,  we  tabulated  the  maximum 
speedup  result  obtained  for  each  row.   In  fact,  this  is  the  maximum  of  a 
set  of  lower  bounds  obtained  for  each  row,  column,  position,  from  the 
minimum  speedup . 

In  Tables  1  and  2,  the  rightmost  columns  are  labelled  "maximum 
possible."  These  refer  to  the  upper  bounds  on  the  number  of  processors 
given  by  Theorems  1  and  2,  respectively.   In  cases  where  this  column  is 
empty,  the  rightmost  value  tabulated  is  the  maximum  possible.   Gaps  in  both 
tables  for  efficiency  values  up  to  1.0  were  not  run  for  technical  reasons, 
but  could  be  filled  in  a  continuous  way. 

Table  1  may  be  interpreted  as  follows.   If  one  had  a  machine 
consisting  of  p  processors,  by  scanning  down  the  column  labelled  p,  minimum 
achievable  speedup  and  minimum  achievable  efficiency  can  be  read  for 
various  problem  sizes  n.   Alternately,  if  one  wants  to  design  a  machine  for 
problems  of  size  n,  a  row  scan  will  indicate  the  maximum  number  of  proces- 
sors required  by  our  method  to  achieve  the  speedup  tabulated. 

It  should  be  noted  that  the  speedups  shown  along  the  diagonal 
with  p=n-l,  which  are  trivial  to  achieve  on  a  parallel  machine,  can  be 
substantially  improved  by  our  methods.   By  reading  across  rows  to  the 
right,  we  see  that  for  small  n,  the  speedup  can  be  improved  by  a  factor 
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of  2  or  3  while  still  maintaining  efficiencies  of  about  10%  or  better. 
For  large  values  of  n  the  speedup  can  be  improved  by  about  a  factor  of  10 
over  the  speedup  of  the  simple  n-1  processor  case,  with  order  of  10% 
efficiencies.   Of  course,  if  a  large  number  of  processors  exist  they  can 
be  used  on  smaller  problems  with  lower  efficiencies  to  obtain  higher 
speedup  improvements  than  mentioned  above. 

Table  2  may  be  interpreted  as  was  Table  1,  although  several 
points  should  be  clarified.   The  values  shown  in  Table  2  were  obtained 
for  the  case  n=16,384.   However,  they  hold  approximately  for  other  n 
values  as  well.   This  may  be  explained  as  follows. 

If  we  approximate  the  parallel  computation  time  of  Theorem  2  as 
T  =  log^m  log_n,  for  a  cutting  scheme  which  uses  strips  of  width  k  the 

speedup  is  approximately 

T 
c  _  __±  Z  mn  mk 


P   T    n  log„m  log„k 

P   -  log2m  log2k     &2    &2 

~   2 
Correspondingly,  we  have  p  =  m  k,  which  leads  to 


S  =  * 2~  .  (25) 

P   m  log2m  log2(p/m  ) 


It  may  be  seen  that  Equation  25,  with  p  held  constant,  has  a  minimum  S 

value  for  some  ra  in  the  range  of  interest  to  us.   Thus,  for  p  >  256,  the 

columns  of  Table  2  exhibit  a  decrease  in  S   and  then  an  increase  as  m 

P 

varies  from  2  to  128. 

Equation  25  also  shows  that  speedup,  by  this  approximation,  is 
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independent  of  n.   Thus,  we  can  use  Table  2  as  an  approximation  for  any  n 

value.   Of  course,  we  should  not  use  arbitrarily  many  processors  on  any 

2 
problem  and  by  Theorem  2  the  limit  can  be  approximated  by  p  <  m  n.   It 

follows  that  entries  in  Table  2  are  valid  approximately  for  n  >  *s-  .   Our 

m 

experiments  were  exhaustive  for  m  _<  128  and  n  <_  16,384  and  our  data 

substantiates  the  above.   The  approximations  are  introduced  here  simply 

to  avoid  presenting  many  similar  numerical  tables. 

Now  let  us  consider  the  speedup  improvement  which  can  be 

obtained  by  our  method  over  the  simple  parallel  implementation  using  p=m 

shown  on  the  diagonal.   For  small  m  and  efficiencies  of  about  10%  we  can 

obtain  improvements  of  3  or  4.   For  larger  m  the  improvement  does  not 

increase  much,  unlike  the  general  recurrence  case.   However,  if  a  rather 

large  number  of  processors  is  available  (say  to  solve  large  general 

recurrences)  then  very  large  speedup  improvements  (at  efficiencies  lower 

than  10%)  can  be  realized  for  small  m.   For  example,  if  16,384  processors 

were  available,  we  could  achieve  a  speedup  improvement  over  the  diagonal 

values  of  about  225  for  m=2  (S  =451.93)  and  about  83  for  m=4  (S  =332.62). 

P  P 

We  conclude  that  for  m-th  order  recurrences  with  small  m,  which  are  in 
practice  encountered  quite  frequently,  the  methods  discussed  here  should 
be  useful  in  exploiting  large  parallel  machines. 
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3.   TIME  AND  PARALLEL  PROCESSOR  BOUNDS  FOR  LINEAR 
RECURRENCE  SYSTEMS  WITH  CONSTANT  COEFFICIENTS 

3.1   Introduction 

Linear  recurrences  with  constant  coefficients  arise  frequently 
in  general  numerical  computations.   Several  analytic  methods  for  the 
solution  of  these  equations  are  available,  such  as  by  solving  for  the 
roots  of  its  characteristic  polynomial  or  by  the  use  of  generating 
functions. 

We  are  interested  in  a  parallel  and  direct  computational  algo- 
rithm which  can  be  used  to  evaluate  them  at  a  high  speed.   Such  syst 
may  be  represented  as  x  =  c  +  Ax"  where  c  is  a  constant  column  vector, 
is  an  nxn  strictly  lower  triangular  matrix  with  bandwidth  of  m,  m  <  n, 
and  all  elements  are  identical  along  each  sub-diagonal.   We  show  that 
0(log2m  log2n)  time  steps  and  0(mn)  processors  are  sufficient  to  solve 
such  a  system.   We  also  show  that  general  recurrences,  i.e.,  with  m  =  n, 
can  be  computed  within  Odog'n)  time  steps  with  at  most  n2/4  processors. 
While  such  systems  remain  in  the  same  order  of  speed  as  in  the  algorithms 
for  arbitrary  coefficients  discussed  in  Chapter  2,  the  bounds  on 
processors  are  sharpened  by  a  factor  of  m  and  n  for  the  m-th  order  and 
general  constant  coefficient  system,  respectively.   To  achieve  these 
results,  all  processors  need  only  perform  one  type  of  operation  at  each 
time  step  (SIMD  operation). 

We  further  show  that  our  method  can  be  modified  to  evaluate  the 
remote  terms  in  a  homogeneous  linear  recurrence  sequence  with  at  most 
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2  3 

0(m  )  processors  in  contrast  to  the  0(m  )  processors  required  by  the 

Miller-Brown  algorithm.   Also,  the  parallel  evaluation  of  n-th  degree 

polynomials  can  be  completed  within  21og9(n+l)  time  steps  with  at  most 

^r-  +  1  processors  which  compares  favorably  with  recent  results  [15],  [l6l 

for  practical  applications. 


3.2  General  Linear  Recurrences  with  Constant  Coefficients 

In  this  section  we  discuss  bounds  on  the  time  and  processors 
for  the  direct  evaluation  of  the  following  class  of  general  linear 
recurrences 


R<n>:   x.  =  0 

1 


for  i  <  0, 


i-1 


=  c.  +  £  a.  x.  .    for  1  <  i  <  n  . 
In  matrix  notation,  we  write  this  as  x  =  c  +  Ax  where  A  = 


strictly  lower  triangular  matrix.   For  example, 


[a.  .  ]    is  a 
nxn 


R<4>: 


r"      "" 

xi 

X2 

= 

*3 

x4 

. 

;ll 


L  *J 


0   0   0   0 


a1  0   0   0 


a2  a±     0   0 


a~   a„   a,   0 


r         "1 

xi 

X2 

• 

X3 

x4 

_ 

For  simplicity,  we  will  assume  n  is  a  power  of  2.  The  main 
result  is  Theorem  3.  We  also  give  Algorithm  3  which  may  be  used  as  a 
basic  algorithm  for  the  parallel  evaluation  of  this  class  of  problems. 
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First,  let  us  state  one  important  lemma  which  follows  directly 
from  Definition  2. 


Lemma  5 


Given  any 


R<n> :   x  =  c  +  Ax  , 


its  associated  matrix  Y  as  defined  in  Definition  2  has  identical  elements 
along  each  sub-diagonal.  For  example,  let 


A  = 


0 

0 

0 

0 

1 

0 

0 

0 

2 

1 

0 

0 

3 

2 

1 

0 

then  Y  = 


0 

0 

0 

0 

1 

0 

0 

0 

3 

1 

0 

0 

8 

3 

1 

0 

It  is  obvious  that  the  j-th  column  of  Y  can  be  obtained  by 
simply  shifting  the  first  column  (j-1)  places  downward.   We  will  denote 
this  operation  as  y .  =  y, [ j-1]  where  y.  is  the  j-th  column  of  matrix  Y, 

These  notations  will  be  used  throughout  this  chapter. 

Now,  we  present  the  proposed  parallel  algorithm  and  the  proof 
of  its  effectiveness  will  immediately  follow. 

Algorithm  3 

a)   Let  B  be  a  lower  triangular  matrix  of  order  (n*n) 


1       1  '  ^"9  '  *  *  *  '  ^"n     ' 


b2  =  (0,  o^,  ct2,  . . .,  «n_1)   , 


and 


b  =  b2[j-2],  3  <  j  £n  . 
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b)  Let  C  be  an  alias  for  B,  i.e.,  B  and  C  represent  the  same 
memory  locations. 

c)  Repeat  this  step  for  i  =  1,  2,  .  ..,  log~n; 

i:   Set  k  =  2  ; 

ii:   Partition  B  and  C  as  shown  in  Figure  6; 

iii:   Compute  S.  -  S.  +  T.  *  Q.     for  1  <  i  <  min(2,?-) 
j    j    j    j  ~   -      k 

simultaneously; 


a)       (2) 


n 


iv:      Set  y  =  y±      [j-1]  for   1   <  j    <  k,    2   <_  ft   <_  £  ; 


v:      Compute  D.   =   D.   +  W.    *   Z. 
3  3  3  3 

simultaneously. 


n 


for  1  j<  j  j<  min(2,-rr-) 


d)   The  first  column  of  B  contains  the  solutions  x.  for 


1  <  i  <  n. 


claim: 


To  justify  this  algorithm,  we  need  only  to  prove  the  following 


Lemma  6 


At  the  end  of  the  i-th  iteration  of  Algorithm  3,  we  have 

-(1)  t 

1)  y1   ■  (x15  x2,  . .. ,  x^)   ; 

2)  for  each  partition 

y£k+l,£k 

yJlk+2,£k       y£k+2,Jlk+l 
(£+1)  .  for 

1   <    I  <  7-  - 
—       —  k 


yik+k,Ak       yj2k+k,£k+l 


J2k+k,  flc+k-1 


where  y   are  elements  of  the  associated  matrix  Y  of  A;  and 
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3)   for  k  <_  —   ,  D  is  the  vector  of  the  partial  sums  of  the 
first  k  terms  in  the  expression  of  x,  .  ,  x,  _,  ...,  x~,  ,  and  similarly 
D2  is  that  of  y3k+lj2k,  y3k+2>2kJ  •••>  y4k>2k  for  k  <f  . 


Proof: 


We  prove  this  claim  by  induction  on  k.   It  is  true  for 


k  =  2,  as  a  basis.   Let  us  assume  it  is  also  true  for  all  iterations  up 

to  —  for  some  k  as  shown  in  Figure  6.   Then,  during  the  current  iteration, 

by  hypothesis  Condition  (1)  the  first  element  of  Q_  is  x   and  S..  is  the 

2 
vector  of  the  partial  sums  of  the  first  -r     terms  in  the  expression  of 

x^   ,  x^   ,  ...,  x   by  hypothesis  Condition  (3). 
y»-l   f-2  k 


Hence,  we  can  write 


■"■ 

~ 

*k 

2 

*k 

2 

Xk 

I 

i 

= 

Sl 

i 
+   1 

ff2 

j 

• 
• 
• 

\ 

rLy2 


T    '2 


k,- 


ak   k    ° 
^■2   ^f  1 


k.frt 


Xk 
2 

1 

—  — 

+ 

xk 
|fl 

\ 

r2 

•            1 

m 
• 

•        1 

• 

k,k-l 

0 

Xk 

— 

_          _ 
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By  virtue  of  Lemma  1  and  hypothesis  Condition  (2) ,  the  bottom  partition 
of  the  above  equation  is  equivalent  to  (S  +  T   *  Q  ) .   This  proves 


Condition  (1).   Similarly,  it  can  be  shown  that  the  matrix  operation 
S  =  S_  +  T  *  Q   specified  in  step  (iii)  generates  the  results  of 

(y3^  J  yi^0    '  ■—  y2k,k>  ■  which  combines  with  (yk+i,k'  yk+2,k'  •••' 

"TiCtJ.  ,  K     _  K.+Z  ,  K. 

y_   )   from  the  previous  iteration  (hypothesis  Condition  (2))  in  forming 
|k,k 


(2) 


(2) 


the  first  column  y    of  partition  Y     of  the  current  iteration. 

—(2) 
With  the  new  result  of  y    ,  now,  we  can  see  that  the  shifting 

operation  in  step  (iv)  will  result  in  Condition  (2)  due  to  Lemma  5. 

After  step  (iii)  and  step  (iv) ,  we  know  that 
Z  =  (x  ,  x  ,  ...,  3c  _i )   •   Since  in  all  previous  iterations,  no  operation 

has  been  done  on  the  data  below  the  partitions  on  the  diagonal, 

Y(l)   for  1  <_  i  ^  |  ,  we  have 


Dl    (ck+l'  Ck+2'  •'•'  ck)   ' 


;ak+l,l   ak+l,2 


w1-| 


a2k,l    a2k,2 


•     • 


^l.k-l 

^+2,1   ak+2,2   *  ^+2^-1 


2k,k-l 


Hence,  the  resultant  vector  D  of  step  (v)  is  a  vector  of  the  partial 

sums  as  stated  in  Condition  (3).   Similarly,  it  is  true  for  D_.   This 
completes  our  proof. 
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ITERATION 


ITERATION    2 


ITERATION    3 


2\ 

3    \ 

6     x 

x\ 

15    x 

X 

x\ 

30   x 

X 

X 

x\ 

64   x 

X 

X 

x  Vy 

130  x 

X 

X 

x     x    >K 

274  x 

X 

X 

J 

X      X       X  \^ 

Figure  7.   Parallel  Evaluation  of  R<8> 
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It  is  worthwhile  to  note  that  the  shifting  operation  stated 
in  step  (iv)  could  be  easily  accomplished  by  a  proper  indexing  function, 
thereby  no  major  computation  time  is  involved  and  more  efficient  use  of 
storage  is  possible  by  such  a  scheme. 

To  illustrate  this  algorithm,  we  use  a  R<8>  as  an  example  shown 
in  Figure  7. 

With  this  algorithm,  we  can  how  establish  a  general  theorem 
about  this  class  of  problems . 

Theorem  3        Any  R<n>  can  be  computed  in 

2 
T  <  log0n  +  21og0n  -  1  . 
p  —    z         £. 

with        p  _<  n  /4  . 

2 

The  minimum  speedup  is  0( r— )  . 

log2n 

Proof:         During  the  i-th  iteration  of  Algorithm  3,  if  a  sufficient 
number  of  processors  are  given,  then  the  matrix  operation  in  step  (iii) 
takes  at  most  one  multiplication  time  plus  the  time  for  the  summation  of 

at  most  Cr+1)   operands,  which  is  log_k  addition  steps.   So  that  for  a 

total  of  log  n  iterations,  this  step  alone  takes 

log?n 

12    3 
t1  5.     l       J  +  log2n  =  2   lo82n  +  2  log2n  * 

Similarly,  step  (v)  takes  at  most  one  multiplication  time  plus 
the  time  for  the  summation  of  k  operands  which  is  log-k  addition  steps. 
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,IK   . 


Thus,  for  a  total  of  log_(y)  iterations,  this  step  needs 


log2  ty  n 

t   <  Z  3   +   log9(y) 


in2    !  i        i 
2  log2n  +  2  log2n  "  X 


Therefore,  the  total  computation  time  T  <  t.  +  t.  ,  i.e., 

2 
T   <_  lo82n  +  21o82n  "  X    ' 

Since  the  serial  computation  time  is  n(n-l),  the  speedup  S   is 


s  >   2  ^ =  o(-3— -)  . 

P   log2n  +  21og2n  -  1       log2n 


To  achieve  the  above  speed,  a  sufficient  number  of  processors 
should  be  provided  at  the  multiplication  time  for  each  iteration  i.   So 
we  will  choose  the  maximum  of  the  number  of  multiplications  in  step  (iii) 
and  step  (v)  as  the  number  of  processors  required  for  that  particular 
iteration.   For  step  (iii),  we  can  see  from  Figure  6(a)  that 

k/2 
p  (k=2X)  <2Ej      for  2  <k<^  , 

J-l 

k/2 
<_  Z  j      for  k  =  n  . 
j=l 

For  step  (v) ,  it  can  be  observed  from  Figure  6(b)  that 


P2(k)  <  2k(k-l) 


for  2  <  k  <  7 
—   —  4 


<  k(k-l) 


*   i    n 
for  k  =  — 
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2  2 

For  large  n,  p1(n)  £  (n  +  2n)/8  and  p2(n/2)  £  (n  -  2n)/4 

give  the  respective  peak  values.   Hence,  the  maximum  number  of  processors 


is  n2/4  . 


Q.E.D, 


As  the  basic  algorithm  stands,  only  21og_n-l  unit  time  steps 

are  multiplications,  the  rest  being  additions.   Also  note  that  in  the 
above  processor  count,  all  processors  perform  uniform  operations  at  each 
time  step,  and  hence  the  bound  can  be  further  improved  by  lifting  this 
restriction. 

We  close  this  section  with  some  observations  about  the  computa- 
tional efficiency  of  this  algorithm.   Although  its  speed  is  approximately 
one-half  that  of  Algorithm  1  in  Chapter  2,  the  number  of  processors  is 
reduced  by  a  factor  of  n.   Thus,  it  gives  us  a  tremendous  increase  in 
computational  efficiency.   On  the  other  hand,  since  R<n>  is  a  special  case 
of  arbitrary  coefficient  system,  we  can  use  the  latter  algorithm  to  compute 
R<n>  with  some  operations  masked  off.   Due  to  Lemma  5,  now  the  matrix 

operations  S.  =  S.  +  T.  *  Q.  in  step  (iii)  of  Algorithm  1  are  performed 
J    J    J    J 

only  for  j  =  1,  2.   All  others  are  masked  and  supplemented  by  shifting 
operations  of  S_.   Further  masking  is  possible  even  in  computing  the 

portion  of  S„  in  the  top  triangular  part  (Figure  2)  with  Q?  and  S_ 

shrinking  to  vectors  instead  of  matrices,  as  in  Algorithm  3  (Figure  6). 
In  this  case,  it  can  be  shown  that  the  maximum  number  of  processors 
required  is  about  n3/128. 
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3.3  M-th  Order  Linear  Recurrences  with  Constant  Coefficients 

In  this  section,  we  will  turn  to  the  considerations  of 
practical  m-th  order  recurrence  problems,  i.e., 

R<n,m>:   x.  =  0  for  i  £  0  , 

m 
=  c.  +  Z  a.  x.  .     for  1  <  i  <  n  , 

i    i=1i   t-j  -    - 

where  1  <_  m  <  n.   In  matrix  operation  x  =  c  +  Ax  with  A  being  a  strictly 
lower  triangular  matrix  of  bandwidth  m.   For  simplicity,  we  will  assume 
both  n  and  m  are  powers  of  2. 

Since  A  is  now  a  band  matrix,  we  can  further  speed  up  the 
computation  of  Algorithm  3  when  i  >.  log92m  .   This  is  done  by  changing 

partitions  S.,  Q.,  T.,  Z.,  D.  and  W.  in  Figure  6  to  that  in  Figure  8, 
3        3        3        3        3  3 

and  introducing  new  partitions  U.,  V.,  G.  and  E.  and  their  corresponding 

3        3        3  3 

matrix  operations  as  described  in  Algorithm  4.   The  proof  of  its  validity 
can  be  found  in  Lemma  3  and  4  of  Chapter  2. 


Algorithm  4 

a)  Let  B  be  a  lower  triangular  matrix  of  order  (nxn)  in  which 

t>2  =  (0>  ai>  a2'  ***'  am'  ^'  ' '  *  »  0)   > 
and       b  =  b2[j-2],  3  <.  j  <  n  . 

b)  Let  C  be  an  alias  for  B,  i.e.,  B  and  C  represent  the  same 
memory  locations. 

c)  Repeat  this  step  for  i  =  1,  2,  ...,  log  n  ; 
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(i)   Set  k  =  2   ; 
(ii)   Partition  B  and  C  as  shown  in  Figure  6 

(1  <_  i  <_  log„m)  and  Figure  8  (log  2m  <_   i  <_  log^n)  . 

(iii)   If  1  <.  i  £.  log»m  , 


n, 


then  compute  S.  =  S.  +  T^  *  Q.   for  1  <  j  <  min(2,— ) 
J    3  J    3  ~       ~  'm' 


simultaneously ; 


ns 


else  compute  S.  =  S.  +  T.  *  Q.   for  1  <  j  <  min(2,— )  , 
3  3  J     j        ~       ~  k 


and  V.  =  V.  +  T  *  U.   for  2  <  j  <  - 
3    J    2    j        -J-k 


simultaneously . 
(£)    (2) 


n 


(iv)   Set  y    =  y.   [j-1]   for  1  <_  j  <_  k,  2  £  l   <_  t-   ; 


(v)   If  1  <_   i  £  log  m  , 


n 


then  compute  D.  =  D  +  W.  *  Z.    for  1  <  j  <  min(2,  ■=— ) 


3  5  3  3 

simultaneously ; 


2m' 


n 


else  compute  D.  =  D.  +  W.  *  Z.    for  1  <  j  <  min(2,-rr-) 
Till         —   —       2k 


and  E.   =  E.   +  W     *  G.        f or  2  <  j   <  Tjr 


3  3 

simultaneously. 


-  2k 


d)   The  first  column  of  B  contains  the  solutions  x.  for 
1  <_   i  £  n. 

For  illustration,  an  example  of  R<16,2>  is  shown  in  Figure  9, 
In  that  figure,  all  numbers  are  kept  in  place  to  help  understanding, 
although  they  may  not  be  used  after  certain  iterations. 

From  the  above,  we  can  obtain  a  general  theorem  for  the  m-th 
order  linear  recurrence  problems. 
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Figure  9.     Parallel  Evaluation  of  R<16,2> 
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Figure  9    (continued).     Parallel  Evaluation  of  R<16,2> 
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Theorem  A  Any  R<n,m>  can  be  computed  in 


T  <_  2   log-n       for  m  =  1, 


2 
£  Ulog-m  +  3)    log.n  -    (log2m  +  log_m  +  1)      for  m  >   1, 


with  p   <  mn. 


The  minimum  speedup  is  0(- ; )  . 

log2m  log2n 


Proof;         For  1^.1^,  log9m,  step  (iii)  is  similar  to  that  of 
Algorithm  3  and  takes  totally 

log2m  12    3 

t  _<  Z   j  +  log2m  =  -  log2m  +  -  log2m  . 

j=l 
However,  for  log  2m  <^  i  <_  log  n,  this  step  needs  only  one  multiplication 

time  and  one  summation  time  of  (m+1)  operands  per  iteration.   This  amounts 
to 

t2  <  (log2(m+l)  +  1)  log2(^)  <  (log2m  +  2)  log^)  . 

For  step  (v) ,  it  is  similar  to  Algorithm  3  for  1  <  i  <  log-(-r-) 

and  the  total  time  is 


log^-j)         m    1    2    1 
t2  —     l  J  +  log2^  =  2   log2m  +  2   log2m  ~  1  * 

This  same  step  will  take  one  multiplication  time  and  one  summation  time 

of  m  operands  per  iteration  for  log_m  £  i  <_  log(— )  .   Thus,  in  total  it 
has 

t4  <  (log2m  +  1)  log  <*)  . 


Hence, 
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T  <   E   t  =  (21og0m  +  3)  log„n  -  (log_m  +  log„m  +  1)  , 
P  -  j  =  1  j        2  2        2       2 

m  >  1  . 


For  m  =  1,  since  the  W.'s  diminish,  T  <  t,  +  t„  ■  21og_n. 

3  p  —  1    2      b2 

Comparing  this  with  the  serial  computation  time  2mn  -  m(m  +  1) , 

the  speedup  is 


S   > 


2mn  -  m(mfl) 


(21og2m  +  3)  log2n  -  (log2m  +  log2m  +  1) 


=  0( 


mn 


log2m  log2n 


)  . 


In  obtaining  the  above  speed,  we  assume  that  there  are  enough 
processors  at  the  multiplication  time  of  any  iteration.   The  number  of 
multiplications  required  in  step  (iii)  can  be  observed  from  Figure  6(a) 
and  Figure  8(a)  that 


p-lOc 


.     k/2 

2  )  <2  Z  j     f or  2  <  k  <  m  , 

J-l 


i(r+l)[Zj+m(f-  m)] 
k      j=l      l 


m  k 

<^  E  j  +  m(—  -  m)     for  k 


n 


for  2m  <_  k  <_  y  , 


=  n 


For  step  (v) ,  we  can  see  from  Figure  6(b)  and  Figure  8(b)  that 
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m 


p2(k)  <  2k(k-l)     for  2  £  k  <  |  , 

m-1 
<_   2  E  j       for  k  =  m  , 
j=l 

m-1 
<  (—+  1)(  Z  j)     for  2m  <  k  <  |  , 

3=1 

m-1 
_<   E  j        for  k  =  y  . 

When  n>>m,  p..  (■—)   =  (3mn  -  6m  +  6m) /4  is  the  maximum  value  and 

2 
p_(k)  =  0(m  )  for  all  k's.   Therefore,  the  processor  bound  is  mn. 

Q.E.D. 


As  was  true  with  Theorem  3,  Theorem  4  requires  only  21og9n  -  1 

steps.   All  processors  are  performing  the  same  operations  at  the  same 
time.   Without  this  restriction,  a  better  processor  bound  could  be 
achieved.   Also  note  that  as  m  increases,  greater  efficiency  can  be 
achieved  by  using  this  algorithm  instead  of  Algorithm  2  for  arbitrary 
coefficients  as  described  in  Chapter  2.   On  the  contrary,, we  can  achieve 
twice  this  speed  by  applying  the  latter  algorithm  with  some  operations 

masked  off.   In  this  case,  the  processor  bound  can  be  shown  to  be 

3 
0(mn  +  m  )}  and  one  might  prefer  it  as  m  becomes  very  small. 

We  will  conclude  this  section  with  some  comparisons  between 

these  results  and  some  known  algorithms  in  the  evaluations  of  two  distinct 

special  cases.   First,  in  computing  the  remote  terms,  e.g.,  x  ,  in  a 

homogeneous  recurrence  sequence  relation 

x-  +  olx.  -,  +  a.x.  .+  ...  +  a  x,   =0, 
1    ±   l-l    2   i-2         m  i-m 
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Miller  and  Brown  [17]  have  shown  an  algorithm  with  about  the  same  speed 

3 
as  that  of  Algorithm  4  by  using  0(m  )  parallel  processors.   However,  it 

can  be  shown  easily  that  all  vectors  U.,  V,,  G,  and  E.  are  zero  vectors, 

j    J    j      J 

and  hence  those  operations  can  be  masked  off  totally.   In  addition,  since 
we  can  leave  out  all  unnecessary  intermediate  results,  parts  of  the 
matrix  operation  S.  =  S.  +  T.  *  Q.  in  step  (iii)  and  D  =  D  +  W  *  Z 

in  step  (v)  can  be  further  masked  out  and  modified  as  shown  in  the  new 

partitioning  in  Figure  10.   Therefore,  the  maximum  number  of  processors 

2  3    2 

is  3m  .   On  the  other  hand,  given  m  +  m  processors,  we  can  use 

Algorithm  2  in  Chapter  2  with  the  new  partitioning  as  shown  in  Figure  11, 

and  achieve  twice  the  speed  of  the  Miller-Brown  algorithm. 

As  a  result,  we  can  establish  the  following  facts. 


Corollary  4.1         Any  remote  term  x  of  a  homogeneous  linear 
recurrence  relation 

x  +  a  x .  ,  +  a o  x.   +  .. .  +  a.   x  =  0  , 
i    1  l-l    2  x-2         ui-m  m 

2 
can  be  computed  within  (21og  m  +  3)  log9n  time  steps  with  0(m  )  processors, 

3 

and  within  (log  m  +  2)  log_n  steps  with  0(m  )  processors. 

Any  polynomial  P  (a)  of  degree  n  can  be  represented  as  the  last 

solution  of  the  following  linear  recurrence  relation 

R<n+l,l>:  x.  =  c.  +  a  x.  ..     for  1  <  i  _<  n  +  1  . 
x    l      x-i  — 

By  Theorem  4,  this  can  be  computed  within  21og  (n  +  1)  steps.   However, 

the  processor  bound  can  be  reduced  as  we  only  need  the  last  solution. 

This  is  done  by  changing  the  partitions  Q  ,  S.  and  T.  for  j  =  1,2  in  Figure  8 
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to  that  in  Figure  10  (note  that  other  partitions  U.,  V.  in  Figure  8 
still  remain).   Thus,  it  can  be  shown  to  have  the  maximum  demand  of 
processors  for  k  =  2,  i.e.,  p  £  I— —  I  +  1.   For  illustration,  let  us 
evaluate  P?(a)  by  Algorithm  4  (or  Algorithm  2  since  for  m  =  1,  both 

algorithms  become  identical)  but  with  the  modified  partitions  mentioned 

above  on  the  matrix  B  =  [b..]     of  Figure  8.   The  whole  process  can 

1J  8x8 

be  summarized  as  follows. 

First,  we  compute 

2 

a  =  a*a  , 


and   b.  .  =  c.  +  c.  ,  a 
i,l    l    i-1 


for  i  =  2,  4,  6,  8  simultaneously; 


then  compute 


4    2  2 

a  =  a  *a   , 


and   bi,l  =  bi,l  +  bi-2,l  a 


for  i  =  4,  8  simultaneously; 


and  finally  compute 


b8,l  =  b8,l  +  b4,l  a 


which  is  now 


=  [(cg  +  c7a)  +  (c6  +  c5a)a  ] 

2  4 
+  [  (c ,  +  c_a)  +  (c~  +  c.a)a  ]a 

The  final  result  in  b--  is  obviously  the  value  of  P_(a).   This  procedure 

may  be  generalized  to  any  degree  n  and  the  computational  process  will 
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proceed  like 


2 
P    (a)   =c     +c      .a+    (c      o+c      0a)a 
n  n  n-1  n-2  n-3 

2      4 
+    (c      .   +  c      ca  +    (c      ,   +  c      -,a)a   )a 
n-4  n-5  n-6  n-7 

+    (cn_8  +  cn_9a  +    (cn_1Q  +  cn_na)a2  +    (cn_12 
+  Cn-13a  +   (cn-14  +  cn-15o)o,2)oA)°8 

"T*      •   •  •  • 

It  is  interesting  to  note  that  the  above  procedure  is  exactly 
equivalent  to  the  well-known  Estrin's  method  [l8l«   In  other  words, 
Estrin's  method  becomes  a  very  special  case  of  our  algorithms  for  evalu- 
ation of  general  R<n,m>  systems.   Hence,  we  can  formalize  it  as  a 
corollary. 

Corollary  2.2  Any  n-th  degree  polynomial  can  be  computed  within 

2  log  (n  +  1)  time  steps  using  at  most  I— r—  +  1  processors. 

By  comparing  this  result  with  the  known  k-th  order  Horner's 
rule,  it  is  slightly  faster  and  generally  requires  less  number  of 
processors  to  achieve  the  same  speed  [10],  [19].   Kuck  and  Maruyama  [16] 
show  that  a  general  polynomial  form  of  degree  n  can  be  evaluated  in 
0(log_n  +  /81og_n)  steps.   While  the  speed  is  faster,  it  requires  far 

more  processors  (p  =  2n)  than  that  required  by  our  result.   In  practical 
applications  where  n  is  not  very  large,  this  method,  even  compared  to 
the  fastest  known  multifolding  method  [15],  becomes  attractive  not  only 
because  of  its  simple  implementation  but  also  its  easy  integration  with 
more  general  linear  recurrence  problems. 
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3.4  Practical  Implications 

As  was  in  the  cases  of  Chapter  2,  the  basic  algorithms  dis- 
cussed here  are  primarily  for  conceptual  understanding.   By  more 
elaborate  scheduling  procedures  or  practical  modifications  of  the 
original  ones,  one  can  develop  more  efficient  computational  algorithms 
which  can  use  much  less  processors  without  significantly  reducing  the 
speedups. 

For  illustration,  during  the  iteration  at  k  =t  of 

Algorithm  3,  we  can  perform  the  multiplications  of  each  inner-product 

in  step  (v)  within  two  steps  instead  of  one  step,  thereby  giving  p  (n/2) 

2  2 

=  (n/2)(n/A)  =  n  /8  and  T  <_   log.n  +  21og  n.   This  folding  scheme 

obviously  halves  the  number  of  processors  needed  and  almost  retains  the 
same  speed  as  Theorem  3.   Similar  procedures  can  be  applied  to  different 
iterations  and  extended  to  multiple-folding  to  achieve  more  efficient  use 
of  processors.   Alternately,  we  can  cut  the  triangular  system  into  strips 
and  then  solve  the  recurrence  system  on  the  top  of  each  strip  with  a 
sufficient  number  of  processors.   We  then  compute  the  partial  sums  in 
the  bottom  of  that  strip  for  the  remaining  equations  as  a  new  constant 
vector  for  the  remaining  triangular  system.   This  process  is  repeated 
from  the  leftmost  strip  to  the  rightmost  strip  sequentially.   This 
cutting  scheme  can  also  help  in  increasing  efficiency  as  shown  in 
Chapter  2. 

We  can  conclude  that  a  proper  design  and  modification  of  the 
basic  algorithms  presented  here  for  a  particular  environment  should 
provide  us  high  efficiency  computations  in  exploiting  future  parallel 
machines . 
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4.   SPEEDUPS  OF  PRACTICAL  LINEAR  RECURRENCE  SYSTEMS 
WITH  A  LIMITED  NUMBER  OF  PROCESSORS 


4. 1   Introduction 

Most  linear  recurrence  relations  which  occur  in  ordinary  pro- 
grams are  of  m-th  order  with  m  being  a  very  small  number.   Chapters  2 
and  3  have  given  algorithms  to  evaluate  any  m-th  order  linear  recurrence 
system  in  0(log?n)  time  steps  by  using  0(n)  processors,  where  n  is  the 
number  of  solutions  to  be  computed  and  assuming  n>>m.   However,  the 
practical  multiprocessing  systems  at  present  may  consist  of  only  a 
limited  number  of  processors,  p  <  n.   In  this  case,  using  those  algorithms 
could  result  in  somewhat  inefficient  computations  as  shown  in  the  middle 
of  the  upper  left  portion  of  Table  2.   The  major  reason  is  the  inherent 
nature  of  non-uniform  distributions  of  processors  required  at  each 
computational  stage.   On  the  other  end  of  the  line,  it  is  trivial  to  show 
that  such  systems  can  be  solved  in  0(n)  steps  using  p  =  m  processors,  with 

a  limited  speedup   S  =  m. 

P 

We  provide  in  this  chapter  two  alternative  algorithms  for  the 

evaluation  of  m-th  order  recurrences.   It  is  shown  that  if  n>>p>m,  then 

any  such  system  can  be  speeded  up  at  least  by  ~  — - —  p   for  arbitrary 

2m+3 

coefficients  and  ~  ^~_     p   for  constant  coefficients,  at  an  efficiency 

independent  of  n.   When  m  is  very  small  in  practical  programs,  computations 
by  these  methods  will  generally  be  more  efficient  than  by  those  discussed 
in  Chapters  2  and  3. 


71 


4.2  M-th  Order  Linear  Recurrences  with  Arbitrary  Coefficients 

In  this  section  we  consider  the  evaluation  of  any  given 

R<n,m>:   x  =  c  +  Ax      for  n  >  p  >  m  . 

For  simplicity,  we  assume  n,  p  and  m  are  all  powers  of  2  although  our 
algorithms  can  be  used  for  arbitrary  values. 

We  will  describe  this  algorithm  symbolically  by  an  R<32,4> 
system  as  an  example.   The  principle  can  apply  to  any  R<n,m>  system. 

First,  let  B  be  a  lower  triangular  matrix  of  order  (nxn) 
and 

B  =  ["c,A]  . 

All  given  data  are  shown  as  dots  in  Figure  12(a). 

Next,  let  us  assume  we  have  p  =  km  =  16,  then  we  can  partition 

B  into  k  =  2-   =  4  subsystems  Z    ,  1  <  1  <  k.   These  are  shown  as  solid 

m         J  —  J  — 

dots  in  Figure  12(a),  in  which  each  Z  "*    ,  2  <_  j  <±   k,  consists  of  the 
solid  dots  in  the  diagonal  pre-concatenated  by  those  in  the  first  column 

of  B.   Thus,  the  number  of  columns  in  each  Z    ,  is  r  -  (m-1)  =  5 

except  Z    with  ^  =  8.   We  also  partition  B  into  Y  J  ,  2  £  j  <_  k,  as 

shown  in  Figure  12(b),  each  with  —  columns.   Every  Z  ■*      or  Y  ~*   will  be 

treated  as  an  independent  recurrence  system  with  the  chopped  corner 
assuming  zero  values. 

In  the  first  stage,  we  solve  Z  J  ,  1  <_  j  _<  k,  simultaneously, 
each  with  m  processors.   This  is  done  in  a  very  straightforward  way, 
i.e.,  multiplying  the  i-th  column  by  the  (i-l)-th  element  of  the  first 
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(a)  MATRIX  B 


Figure  12.  Dynamic  Partitioning  for  Algorithm  5 
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Figure  12    (continued) .     Dynamic  Partitioning  for  Algorithm  5 
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Figure  12    (continued) .     Dynamic  Partitioning  for  Algorithm  5 
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column  of  Z    ,  and  adding  the  resultant  vector  to  the  first  column. 
This  proceeds  successively  from  i  =  2  to  the  last  column.   Having  done 
this,  we  will  have  in  Figure  12(c) 


D0  ==  (xl'  X2'  •*•'  Xn/k-l)  ' 


and,  in  particular 


Z,  = 


(Z11'  Z12' 


•"  Zlm) 


where  z..  denotes  the  partial  sum  of  the  first   (m+2-j)  non-zero  terms 
in  the  expression  of  x        ,  1  <_   j  £  m,  respectively. 

<£>i+j-i 

In  the  second  stage,  we  solve  each  Y  J   and  its  associated 
(m-1)  subsystems  (defined  in  Definition  2)  in  parallel,  and  also  simul- 
taneously for  2  <_  j  _<  k. 

As  a  result ,  we  have 


W.  = 
J 


0 


r+l,r 


r+2,r    yr+2,r+l 


r-hn,r 


y         *  *  *  y 
r+m,r+l  yr+m,r+m-l 


y  ,n  -    y  ,n  _   ... 
Jr-h— l,r   r+r—  l,r+l 
k         k 


r-h — l,r+m-l 
k 


for  1  £  j  <_  k  -  1  , 
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and 


Y.  = 
J 


y  r 


yr,|+m-l 


yr+m-l,|  ' 


r+m-l,yfm-l 


for  2  <_   j  £  k, 


,ns 


where  r  =  (— )j  and  y's  are  elements  of  the  associated  matrix  Y  of  A. 


Now,  by  applying  Lemma  3  repeatedly,  we  can  compute  the 
following  first-order  recurrence  system  of  vector  matrix  form 


Z.  =  Z.  +  Y.  Z.  . ,     2  <  i  <  k 
xxi   x-1        —   — 


(26) 


where  new 


x     xl   x2 


•  * '  Z*J) 

xm 


and  z..  is  defined  as  above.   Note  that  Z,  and  Y,  have  been  augmented  by 
xj  k      k 

zeroes  to  become  conformable  and  the  first  element  of  each  resultant  Z., 

x 

i.e.,  z.-,  1  <  i  <  k,  is  the  solution  x,  ,.  s  .    . 
xl'   —   —  (n/k)x 

After  that,  by  the  same  reasoning  of  Lemma  3,  we  may  evaluate 

all  other  solutions  by 


D. 

x 


+  W.Z.     for  1  <  i  <  k  -  1, 
xx  —   — 


simultaneously . 


We  may  summarize  this  procedure  as  follows, 
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Algorithm  5 

a.  Let  B  be  a  lower  triangular  matrix  of  order  (nxn)  in 

which  the  i-th  column  contains  a.  .  .,  for  j  <  i  <  n,  where  a. rt  =  c. 
J  i,j-l     -j  _   —  '        iO    i 

and  a. .  =  0  for  i  -  j  >  m  initially. 

b.  Let  C,  E  be  aliases  for  B,  i.e.,  B,  C  and  E  represent  the 
same  memory  locations 

c.  Set  k  ■»  *■  and  h  =  7-  ; 

m        k 

i:   Partition  B,  C  and  E  as  shown  in  Figure  12(a),  (b) 
and  (c) ,  respectively,  in  which  the  order  of  each 
partition  is 

Z(j):    (h+m-1,  h)      for  j  =  1, 

(h,  h-m+1)      for  2   <_  j  <_  k  -  1  , 
(h-m+1,  h-m+1)   for  j  =  k  , 

y'^':   (h+m-1,  h)      f or  2  <_  j  <   k  -  1  , 
(h,  h)         for  j  =  k  , 


Dy  (h-1,  1)        for  j  =  0  ,  • 

(h-m,  1)        for  1  <  j  <  k  -  1  , 


for  1  <_  j  <   k  -  1  , 

for  j  =  k  , 

for  1  _<  j  <_  k  -  1  , 

for  2  <  j  <_  k  -  1  , 

for  j  =  k  . 


Z.: 
3 

(m,    1) 

(1.    1) 

W.: 
J 

(h,   m) 

Y.: 
J 

(m,   m) 

(I,   m) 
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ii: 


iii: 


Compute  all  recurrence  systems  Z    ,  1  <_  j  <^  k, 
simultaneously; 

Compute  each  recurrence  system  Y  J   and  its 
associated  (m-1)  subsystems  in  parallel,  and  also 
simultaneously  for  2  <  j  <  k; 


1  <  i  <  n. 


iv:   Compute  the  recurrence  system  of  vector-matrix  form 


Z.  =  Z_,   +  Y.  Z.  .  ,    2  <  i  <  k  . 
x    i    i  x-1      —   — 


v:   Compute 


+  W.Z. 
1  x 


for  1  <  i  <  k  -  1 


simultaneously. 


d.   The  first  column  of  B  contains  the  solutions  x.  for 

x 


Note  that  at  the  end  of  step  (ii) ,  solutions  x.,  1  <_  i  _<  h, 


are  in  place;  at  the  end  of  step  (iv)  ,  x,  . ,  1  <_  i  <^  k,  are  also  in  their 

corresponding  positions. 

In  the  following  we  use  a  R<16,1>  as  an  example  to  illustrate 

this  algorithm.   Note  that  matrices  Z.  and  Y.  become  scalars  in  this 

i      J 

case. 


Example  4 


where 


Given  <16,1>:   x  =  c  +  Ax 

c   =  (1,2, 1,1, 1,5, 4, 1,2, 1,2, 4, 3, 1,2,1)*  , 

a .  .  =  0   for  i  -  j  >  1   or   j-i^.0, 
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and  the  first  subdiagonal  of  A  is 


ad  s  (a21'a32,a43'*",a16,15) 

=  (1,1,2,2,4,2,1,1,2,1,1,2,3,4,1) 


If  we  have  p  =  4,  then  we  have  k  ■  4  and 


(1)  _ 


(4)  . 


(2)  . 


1 

"l 

"2 

" 

2      1 
10      1 

z(2>  = 

5 
4 

4 
0     2 

z(3)  = 

1 
2 

2 
0 

1 

10     0 

2 

, 

1 

0     0 

1 

, 

4 

0 

0     1 

- 

3 

1     3 

2     0     4 

10     0 

1 

, 

2 

"  1 

"  2 

- 

0     4 
0     0     2 

Y(3)   . 

0 
0 

2 
0     1 

Y(4)  m 

0 
0 

3 
0 

4 

0     0     0 

1 

, 

0 

0     0 

1 

, 

0 

0 

0     1 

The  solutions  of  Z   ,  1  <  i  <  4,  are 


Do" 


(x1,x2,x3)  =  (1,3,4),  Z1 
(1,9,22),   Z2  =  23  , 


(2,5,7),   Z3  =  11  , 


(3,10,42),  Z4  -  43  . 


=  x,  =  9  , 


80 


and  the  solutions  of  Y    ,  2   <_   i  <_  4,  are 
W*  =  (0,2,8,16),  Y2  =  16  , 


Wj  =  (0,1,2,2),   Y3  =  2   , 


\t*   =  (0,2,6,24),  Y4  =  24  . 


With  these  results,  we  will  solve  the  following  recurrence 


system 


to  obtain 


Z.  =  Z.  +  Y.  Z.  -     for2<i<4 
x    i    i  l-l  —   — 


Z1  =  x4  =  9  , 


Z2  =  xg  =  167  , 


Z3  "  X12  "  345  ■ 


Z.  =  xn.  -  8323  . 
4    16 


Thus,  the  rest  of  the  x's  can  be  computed  as  follows. 


X4 

9 

0 

9 

x5 

1 

+ 

2 

•9  = 

19 

X6 

9 

8 

81 

-X7. 

22 

16 

166 

X8 

167 

0 

167 

X9 

2 

1 

169 

= 

+ 

•167  = 

X10 

5 

2 

339 

J"u 

7 

2 

341 

81 


12 


x 


13 


14 


15 


345 

0 

3 

+ 

2 

10 

6 

42 

24 

_            -m 

_     _ 

345 


345] 
693 

2080 
8322 


Now,  we  have  the  complete  solution  set  {1,3,4,9,19,81,166, 
167,169,339,341,345,693,2080,8322,8323}  for  the  original  system.   With 
this  algorithm,  we  can  establish  a  general  theorem. 


Theorem  5 


If  n>>p>m,  then  any  R<n,m>  can  be  computed  by  using  p 


processors  within  (2m  +3m)—  +  0(m  log_(p/m))  time  steps.   The  minimum 


speedup  is  approximately   ~ _,   p 


Proof: 


In  step  (ii)  ,  each  system  Z   ,  2  j<  j  <_  k,  can  be 


computed  by  m  processors  within 


o  /inn    \    1  /TOR        1  \ 

t   <  2(—  -  m)  <  2(- 1)  . 

1  -   p       -   p 


(j) 


In  step  (iii) ,  one  processor  is  assigned  to  every  Y  J   and 
each  of  its  associated  (m-1)  subsystems  for  2  <^  j  _<  k.   This  requires 


t„  <  (2m  -m) (2m-l)  . 

I  —  p 


By  applying  Algorithm  2  to  step  (iv)  now,  during  each  iteration 

i         k 
i  (let  r=2  )  we  have  —  matrix-vector  products  as  well  as  vector-vector 

k   r 
additions,  plus  (—  -  —)   matrix-matrix  products.   These  amount  to 

km  ,  ,k   r,   2  .        ,      ^  „  .      ,         . 

Tj-  +  (y  -  y)  m  inner-products  of  2  m-vectors  in  total.   By  assigning 
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one  processor  to  each  inner-product  and  assuming  vector-vector  additions 
are  performed  at  the  end  of  each  iteration,  then  it  takes 


k  2 

t3  <  log2  k  +  E  (2m-l)  f(jp  (m+1)  -  ^/p] 
r=2 


<  [(2m-l)£  +  1)  +  1]  log9(£)  . 


For  the  last  step  (v) ,  there  are  n  -  —  -  (k-1)  pending  solutions, 
each  requiring  one  inner-product  and  one  addition.   Hence,  totally  we  need 
t4<  (2m)|-(n-  if**-   1))/P]  • 

The  entire  computation,  therefore,  can  be  done  within 

9     n        9 

T  £  E   t.  -  (2m  +3m)-  +  0(m  log,  (p/m)) 
P   i=1  i  P 

for  n>>p>m  . 
And  since  the  serial  computation  time  is  T-  =  2mn  -  m(m+l) , 

2mn  -  m(m+l) 


P    (2m2+3m)-+  0(m2log9  (p/m)) 
P  *• 


2 

p     for  n>>p>m 


2m+3 

Q.E.D, 


If  only  the  last  solution,  say  x  ,  is  required,  then  step  (v) 
can  be  omitted.   This  gives  us 

Corollary  5.1  If  n>>p>m,  then  any  remote  term  x  of  R<n,m> 

9    n        2 

can  be  computed  by  using  p  processors  within  (2m  +m)—  +  0(m  log-  (p/m)) 


2 
time  steps.   The  minimum  speedup  is  approximately  — - —  p 

2m+l  r 
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For  homogeneous  R<n,m>,  a  slight  modification  of  the  above 
algorithm  should  have  more  efficient  computations  than  that  stated  in 
Theorem  5  and  Corollary  5.1.   However,  we  will  not  elaborate  on  it  here. 

We  close  this  section  with  some  observations  about  the  computa- 
tional efficiency.   A  quick  comparison  could  be  made  at  e.g.,  m  =  1  as 
shown  in  Example  4,  in  which  we  solve  the  following  system 

R<n,l>:   x.=c.+a.x.n  ,     1  <  i  <  n  , 

ill  l-l        —   — 

in  ~  —  time  steps  in  general,  and  within  "  —  time  steps  if  only  the 
P  P 

last  solution  is  needed.   For  the  latter  case,  we  can  compare  it  to  the 
p-th  order  Horner's  rule  for  the  parallel  evaluation  of  n-th  degree 
polynomials  (n+l>>p>2)  which  is  equivalent  to  finding  the  last  solution 
of  the  following  system 

R<n+l,l>:  x.  »  c.  +  a  x_,  ,  ,      1  <  i  <  n  +  1  . 

i    i      i-I         —   — 

It  is  well  known  that  p-th  order  Horner's  rule  will  take  ~  —  time  steps 

P 

while  our  method  only  needs steps  to  evaluate  a  far  more  complex 

y  P 

system. 

For  more  numeric  comparisons  with  the  previous  methods,  we  list 
in  Table  3  some  experimental  results  for  n  =  16,384.   It  is  easy  to  see 
that  for  small  m  this  alternative  algorithm  can  give  far  better  speedups 
and  efficiencies  than  the  ones  in  Table  2,  as  shown  in  the  entries  to  the 
right  of  the  bracket  in  each  row. 

For  example,  given  p  =  256  we  can  obtain  a  speedup  improvement 

70  32 
over  the  previous  method  of  about  5  (=    *   )  for  m  =  2  and  about 

14 .  81 
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3  (=  ->/ >yc.)    f°r  m  =  4.   However,  when  m  becomes  bigger  which  is  not  the 

practical  case  usually,  then  this  method  will  become  impractical.   To 

show  the  differences  for  a  moderate  range  of  n  we  use  double  entries  in 

Table  4  for  n  =  128  and  Table  5  for  n  =  1024.   Each  entry  represents 

S   and  E  .   The  left  entries  are  for  the  previous  algorithm  with  p 
p      p 

omitted,  and  the  right  ones  for  the  new  algorithm.   From  these  data  and 

other  untabulated  results  for  different  ranges  of  n,  we  can  conclude  that: 

Observation  In  the  evaluation  of  m-th  order  linear  recurrences , 

if  the  following  conditions  are  satisfied: 

-i  \   n       .2 
!)  2  -  P  -   Am 

and       2)  m  <_  16  , 

then  the  computation  by  Algorithm  5  will  yield  the  same  or  better  speedup 
(and  hence  efficiency)  than  by  Algorithm  2. 

4.3  M-th  Order  Linear  Recurrences  with  Constant  Coefficients 

In  this  section  we  discuss  an  efficient  algorithm  for  R<n,m> 
when  n>p>m. 

From  Lemma  5  in  Chapter  3,  we  know  that  the  partitions  of 

00    00 
Figure  12  now  will  have  (by  neglecting  the  augmented  parts  of  Z    ,  Y   , 

Vi and  V 


Z(D  „  zo> 


CO   .  y(j)  for  2  ^  ±       k>    2  <  j  £  k   , 
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and  hence  the  resultant  matrices 


Wi=W      for  l£i  <  k-  1,  1  <  j  <  k-  1  , 


it-t 


for  2  £  i  £  k,  2£j£k  . 


Further, 


t.  =  t^Ij-1]     for  2  <  j  <  m  , 


where  t.  is  the  j-th  column  of  matrix  T  as  shown  in  Figure  13,  and 

brackets  [x]  indicate  shifting  operations  as  defined  in  Chapter  3. 

With  these  properties,  we  need  only  to  solve  the  recurrence 

(2)  —      — 

system  Y    (without  any  associated  subsystem)  to  obtain  w.  -  and  y91 • 

After  the  shifting  operation  of  T,  we  can  complete  the  evaluation  of 
Y2  by 


Y2  =  Y2  +  QW;L  . 


Therefore,  we  can  establish  the  following  method  for  R<n,m> 


problems 


Algorithm  6 


a.   Let  B  be  a  lower  triangular  matrix  of  order  (nxn)  in  which 


bl    (C1'  °2'  •*"  °n)   ' 


b?  -  (0,  an,  a0,  ...,  a  ,  0,  ...,  0) 


V  ^2 


m 


and 


b  =  b2  [j-2],  3  <  j  <  n  . 
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Figure  13.      Dynamic  Partitioning  of  Matrix  E  for  Algorithm  6 
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b.  Let  C,E  be  aliases  for  B,  i.e.,  B,  C  and  E  represent  the 
same  memory  locations 

c.  Let  k  =  £ and  h  =  —  ; 

m  k 


Partition  B  and  C  as  shown  in  Figure  12(a)  and  (b) , 
respectively.   Partition  E  as  shown  in  Figure  13, 
where  T  is  of  order  (h+l,m) ,  and  W  ,  Y  ,  Q  are  all 


of  order  (m-l,m-l); 


(j) 


ii:   Compute  all  recurrence  system  Z  J   for  1  <^  j  j<  k 

(2) 
and  Y    ,  simultaneously; 


iii:   Let  t.  =  t-  [j-1]  for  2   <_  j    <.  m,  and  then  compute 


Y2  -  Y2  +  QWl  ; 


iv:   Compute  the  recurrence  system  of  vector-matrix  from 


zi "  zi  +  Vi-i  > 


2  <  i  <  k  ; 


v:   Compute 


D. 


D. 


+  W-Z.     for  1  <  i  <  k  -  1, 
1  i  —   — 


simultaneously . 


d.   The  first  column  of  B  contains  the  solutions  x.  for 

x 


1  <  i  <  n, 


In  analogy  with  Theorem  5,  we  can  derive  a  general  theorem  from 
this  algorithm. 


91 


Theorem  6  If  n>>p>m  then  any  R<n,m>  can  be  computed  by  using 

p  processors  within  "^  _  n   +  0(m  log   (p/m))  time  steps.   The  minimum 
speedup  is  approximately   r*- _  p  . 


Proof:  Each  independent  system  in  step  (ii)  can  be  computed 

by  m  processors  within 

.  .,mn    \    ^   n  z™1   i  n 

t     <   2( m)  <  2(— —  -l;  . 

1  —  xp-m     —   p-ni 

2 
There  are  (m-1)   inner-products  of  2  (m-l)-vectors  in  step  (iii) 

One  processor  can  compute  one  inner-product  in 

2. 


t2  <  2 (m-1) 


(m-1)' 


• 


In  step  (iv) ,  we  can  use  Algorithm  4.   Then,  during  each 

i  k 

iteration  i  (let  r  =  2  ) ,  there  are  —  matrix-vector  products  as  well  as 

r  k 

vector-vector  additions,  plus  -r    matrix-matrix  products  for  2  <_  r  <_  —  . 

Applying  the  same  strategy  as  that  for  step  (iv)  in  Theorem  5,  we  require 

k/2  2 

t3  <  (  I   (2m-l)  f(|S+  a-)/p|)  +  (2m-l)[^|  +  log2  k 

<  [(2m-l)(|+  1)  +  1]  log2  (£)  . 
Similarly  to  that  for  Theorem  5,  step  (v)  takes 

Thus ,  the  algorithm  totally  requires 

4 
T  <  j  t    2mn(2p-m)  +  q^  (  /m))   for  n»p>m  . 

v       i=i      p(.p-mj  l 
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Since  T  =  2mn  -  m(nrl-l)  ,  we  have 


2mn  -  m(m+l) 


p  —  2mn(2p-m)  n/   21     ,    ,    N. 

p(p-m)  +  °(m  log2  (p/m)) 

p-m  , 

% p  for  n>>p>m  . 

2p-m 


Q.E.D. 


By  omitting  the  last  step  (v) ,  we  can  easily  show  that 

Corollary  6.1  If  n>>p>m  then  any  remote  term  x  of  R<n,m>  can 

be  computed  by  using  p  processors  within  — 1-  0(m  log»  (p/m))  time 

steps.   The  minimum  speedup  is  approximately  p-m. 

It  is  easy  to  see  that  this  algorithm  offers  a  very  efficient 
computation  for  linear  recurrences  with  constant  coefficients.   Since,  by 
letting  e.g.,  p  =  2m,  we  have  —  <_  E   <_  —   ,  and  if  only  the  last  solution 

is  required  we  have  —  <_   E   <  1.   In  other  words,  by  using  this  algorithm 

under  practical  situations,  we  may  obtain  speedups  proportional  to  the 

limited  number  of  processors  we  have,  and  the  constant  of  proportionality 
S 

P   P 

possible  speedup,  is  at  least  one-third  or  more.   Also,  it  is  interesting 
to  note  that  for  the  parallel  evaluation  of  n-th  degree  polynomials,  which 
is  a  special  case  of  Corollary  6.1  with  m  =  1,  this  method  has  the  same 


E  =  —*-   ,  which  can  be  regarded  as  the  ratio  of  the  actual  and  maximum 


order  of  speed  as  does  the  p-th  order  Horner's  rule  (n  +  l>>p  _>  2), 

with  efficiency  E  >  -r-  . 
p  -  2 

As  was  true  with  Algorithm  5,  Algorithm  6  could  be  expected 
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to  give  higher  speedups  and  efficiencies  for  the  evaluation  of 
practical  R<n,m>  systems.   Similar  conditions  may  be  observed  as  we 
did  in  the  preceding  section  to  warrant  the  use  of  this  algorithm. 
However,  as  it  is  a  special  case  of  R<n,m>  systems,  the  previous 
observation  can  still  hold  for  R<n,m>  systems. 
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5.   SPEEDUP  OF  LOOP  COMPUTATIONS  IN  ORDINARY  PROGRAMS 

5.1  Introduction 

With  all  the  studies  of  linear  recurrence  systems  behind  us, 
we  now  proceed  to  look  at  the  possible  speedups  of  iterative  programs 
which  are  usually  written  as  a  loop  in  some  high-level  languages  using, 
for  example,  Fortran  DO  or  Algol  FOR  statements.   As  stated  in  Chapter  1, 
there  are  apparently  two  ways  to  increase  speeds  of  these  loop  computa- 
tions on  a  multiprocessing  system.   The  first  one  is  to  devise  new 
parallel  algorithms  and  provide  existing  languages  with  adequate  parallel 
features.   This  approach  will  help  by  providing  explicit  parallelism  in 
the  program  and  reduce  the  time  otherwise  spent  in  iterative  loops.   For 
instance,  the  development  of  parallel  FFT  algorithm  is  a  prominent 
example  [20] •   All  algorithms  in  the  previous  chapters  can  be  used  to 
replace  some  iterative  loops  used  to  solve  triangular  systems  in  many 
numerical  programs.   An  extensive  survey  in  these  areas  can  be  found  in 
[21],  [22].   What  we  are  interested  in  here  is  the  second  approach,  i.e., 
automatically  detecting  the  implicit  parallelism  in  ordinary  program 
loops  at  the  compiler  level. 

Many  people  have  worked  in  this  area.   In  his  automatic  pro- 
gram analysis,  Russell  gives  some  algorithms  to  test  the  possibilities 
of  replication  of  simple  loops  [  5].   The  most  extensive  studies  are 
done  by  Muraoka  [10] •   He  provides  methods  to  analyze  subscript  ex- 
pressions, and  extract  parallelism  in  a  loop  with  single  assignment 
statement  by  the  replication  or  wave-front  method.   The  latter  scheme 
is  also  used  by  Lamport  for  an  ILLIAC  IV  Fortran  compiler  [23].   For 
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loops  with  multiple  assignment  statements,  Muraoka  also  describes 
algorithms  to  exploit  parallelism  by  replication  of  the  whole  loop  or 
by  distribution  of  the  original  loop  into  a  set  of  smaller  loops,  as 
well  as  by  the  wave-front  method  between  assignment  statements.   Lately, 
we  have  generalized  most  of  his  work  and  implemented  them  in  a  Fortran 
program  analyzer  to  measure  the  potential  speedups  in  ordinary  Fortran 
programs  [7  ],  [8  ]•   Cohagan  recently  also  reported  techniques  to 
generate  vector  operations  from  DO  loops  in  the  TI  ASC  Fortran  compiler 
[  24] .   All  of  these  works  have  demonstrated  the  feasibility  of  detecting 
a  fairly  large  amount  of  parallelism  from  ordinary  program  loops. 

However,  for  one  of  the  frequently  encountered  statements  in  a 
loop  like 

X(I)  =  X(I-l)  +  Y  , 

most  of  the  previous  workers  do  not  handle  it  because  of  its  sequential 
nature.   In  fact,  this  is  a  special  case  of  linear  recurrence  relations. 
Muraoka  has  discussed  this  class  of  first-order  recurrences  and  shows 
that  they  can  be  speeded  up  significantly  by  forward  substitution  and 
tree-height  reduction  algorithms.   Nevertheless,  for  a  slightly  different 
statement  such  as 

X(I)  =  X(I-l)  +  X(I-2)  +  Y  , 

the  above  method  will  become  very  difficult  in  implementation  and  may 
result  in  poor  speedup  if  the  number  of  iterations  is  large.   But,  by 
using  the  algorithms  in  Chapter  2  through  Chapter  A,  which  are  simple, 
efficient,  and  require  no  complicated  machine  structure  to  support  them, 
we  feel  that  it  is  practically  feasible  and  desirable  to  extract 
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parallelism  from  even  these  highly  serial  codes  in  ordinary  programs. 

In  section  2,  we  discuss  theoretically  the  speedup  spectrum 
of  different  types  of  loops  in  terms  of  the  problem  size  (sequential 
computation  time  T  ).   As  shown  before,  the  order  of  any  recurrence 
represented  in  the  loop  should  be  the  smaller  the  better  as  far  as  the 
speed  and  processors  are  concerned.   To  facilitate  this,  we  introduce 
informally  in  section  3  the  distribution  algorithm  to  decompose  a  given 
loop  into  loops  of  smaller  sizes.   By  this  algorithm,  we  may  also  be 
able  to  expose  vector  operations  as  much  as  possible.   In  section  4,  we 
discuss  time  bounds  of  some  loop  structures  with  known  computation 
patterns  (dependence  graphs)  and  certain  measurable  parameters,  with 
which  the  minimum  speedup  on  a  multiprocessing  system  of  a  particular 
class  of  loop  computations  (and  hence  the  algorithms  they  represent) 
can  be  estimated.   At  last,  some  practical  implications  for  compiler  and 
machine  design  are  described  in  section  5. 

Before  we  proceed  further,  some  basic  definitions  and  notations 
are  given  below  which  will  be  used  throughout  the  rest  of  this  chapter. 

Definition  7         An  assignment  statement  is  denoted  by 

x  =  E 

where  x  is  a  single  or  array  variable,  and  E  is  a  well-formed  arithmetic 
expression  composed  of  the  four  arithmetic  operations  (+,-,*,/),  left 
and  right  parentheses,  and  atoms  which  are  constants  or  variables. 

Definition  8         A  loop  is  denoted  by 

L  =  (I1  «-  Nr  I2  *-   N2,  ...,  Id  <-   Nd)(S1,  S2,  ...,  Sg) 

or  =  (1^  I2,  ...,  Id)(S1,  S2,  ...,  Sg) 
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where  I,  is  called  loop  Index,  N,  is  an  ordered  set  and  called  index  set, 

and  S,  is  called  body  statement  which  may  be  an  assignment  statement  or 

another  loop.   We  will  call  it  a  basic  loop  if  there  is  no  S,  which  is 

a  loop. 

For  example,  a  basic  loop  like 

L:    DO  S   I  =  1,  10,  2 

DO  S2   I  =  1,  10,  1 

S±:        Y(I1,I2)  =  A(I1,I2)  +  B 

S2:    Z(IltI2)  =  C(I1,I2)  +  D 

is  written  as  L  =  (I  +■  1^,  I2  «-  N2)(S1,S2)  or  simply  L  =  (Ii»I2^  (Si»S2^  ' 
where  N-  -  {1,3,5,7,9}  and  N  =  {1,2,3, ... ,10}  with  InJ  =  5  and 
|n2|  =  10. 

5.2  Speedup  Characteristics  of  Program  Loops 

In  this  section,  we  will  characterize  program  loops  according 
to  their  possible  speedups  over  serial  computation  time  T..  . 

Obviously,  a  loop  is  equivalent  to  a  sequence  of  assignment 
statements  according  to  the  original  execution  order  defined  by  the  index 
set.   In  order  to  formalize  the  problem,  let  the  i-th  statement  of  such 
sequence  be 

x.  =  E.  ,    1  <  i  <  n 
l    l       —   — 

where  n  is  the  total  number  of  assignment  statements  executed  by  the  loop. 
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For  instance,  given  a  loop  L  as  that  in  the  preceding  section,  we  have 

L:   x   =  Y(l,l)  =  A(l,l)  +  B 
x2   -  Z(l,l)  =  C(l,l)  +  D 


xA9  =  Y(9,10)  =  A(9,10)  +  B 
x5Q  =  Z(9,10)  =  C(9,10)  +  D  . 

If  none  of  the  expressions  E.,  1  _<  i  _<  n,  contains  an  atom 

x.    for  m  >  0,  then  there  is  no  dependence  relation  between  any  pair 

i-m 

of  statements.   Thus,  all  x.  =  E.,  1  _<  i  _<  n,  can  be  computed  in  parallel. 

We  will  call  this  set  of  loops  R.   For  example,  the  following  loop  which 
performs  matrix  addition  and  scalar  multiplication, 

DO  5  I  =  1,  10 

DO  5  J  =  1,  10 

Y(I,J)  =  A(I,J)  +  B(I,J) 

5    Z(I,J)  =  C(I,J)*D  , 

has  this  property. 

The  total  time  required  by  any  LeR  is,  by  Brent's  argument  [25], 

Tp  <  41og2  e 

where 

e  =  max  {e.}  where  e.  is  the  number  of  atoms  in  E., 

1  <  i  <  n 
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Hence,  we  have  for  LeR 


S   "  tj~ —  -  OCT,)  .  (27) 

p  —  41og0e      r 


as 


Now,  let  us  study  a  slightly  more  complicated  loop  such 
one  that  performs  vector  inner-product 

DO  5  I  =  1,  10 
5   T  =  T  +  A(I)*B(I)  . 

Loops  of  this  kind  have  the  property  that  some  of  the  expressions  E  , 

1  <  i  <  n,  contain  a  linear  combination  of  x.     for  m.  >  0.   We 
—   —  l— m.       1 

will  call  this  set  of  loops  LR.   For  any  LeLR,  if  we  compute  simultaneously 

all  subexpressions  in  E. ,  1  j<  i  j<  n,  which  do  not  depend  on  any  computed 

value  in  the  loop,  i.e.,  x.  for  1  _<  i  <^  n,  then  the  resultant  loop  can 

be  treated  as  R<n,m>  system  where  m  <  n  is  the  maximum  of  m.  for  all  i. 

This  can  be  illustrated  by  the  above  example  of  vector  inner-product  in 
which  we  can  evaluate  A(I)*B(I)  for  1  <_  I  <_   10  in  parallel.   As  a  result, 
we  obtain  all  required  coefficients  to  compute  an  R<10,1>  system 
represented  by  the  original  loop. 

The  total  computation  time  of  such  loops  is,  therefore,  pre- 
processing time  of  coefficients  which  is  41og_e  time  steps,  plus  the 
time  to  solve  an  R<n,m>  system  which  is  stated  in  Theorem  2.   Since 
n  _<  T_  £  ne,  we  have  a  speedup  for  this  subset  of  LR 

T, 


S  _> ■ 5 ~~ 

P  "'  (log2m  +  2)log2n-l/2(log2m  +  log2m)  +  4log2e 

=  0(— i— )  •  <28> 

log2T1 
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Next,  consider  the  subset  of  loops  which  has  m  =  n.   For 
example,  given  an  upper  triangular  matrix  A,  to  solve  Ax  =  b  by  the 
traditional  back-substitution  method  we  may  write  a  loop  like 

DO   5   I  =  10,  1,  -1 
X(I)  =  B(I)/A(I,I) 
DO   5   J  =  I  +  1,  10,  1 
5    X(I)  =  X(I)  -  (A(I,J)/A(I,I))*X(J)  . 

In  this  example,  if  we  preprocess  B(I)/A(I,I)  for  all  I,  and  A(I,J)/ 
A(I,I)  for  all  I, J,  we  obtain  an  R<n>  system.   Since  m  =  n,  this  is  the 
worst-case  loop  of  LR.   Hence,  we  can  say  that  the  computation  time  of 
any  LeLR  is  less  than  41og„e  plus  the  time  stated  in  Theorem  1.   There- 
fore, we  have  for  any  LeLR 


S   > 


l/21og2n+3/21og2n+41og2e 


=  0( 


lo*2Tl 


)  •   (29) 


Next,  we  study  a  simple  looking,  but  more  complicated  loop: 

DO   5   I  =  1,  10 
5    X(L)  =  (X(I-l)  +  A/X(I-l))/2  . 

This  is  a  familiar  iterative  program  for  approximating  vA   .   This  class 
of  loops  has  the  property  that  some  of  the  expressions  E.,  1  <^  i  <^  n,  are 

non-linear  combinations  of  x.     for  m.  >  0.   We  denote  this  set  by  LR. 

l-m.       i 

i 

LU  decomposition  algorithms  in  standard  numerical  programs  are  another 
well-known  example  in  this  class. 

Muraoka  shows  that  by  using  forward  substitution  any  loop  with 
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E.  being  a  d-th  degree  polynomial  of  x._i»  d  >  1,  can  be  speeded  up  at 

most  by  a  constant  factor.   Kung  also  shows  this  later  by  forward 
substitution  techniques  [26].   However,  it  remains  an  open  question 
whether  there  is  some  algorithm  besides  forward  substitution  which  may 
speed  up  this  class  of  loops  by  more  than  a  constant.   At  the  present 
time,  we  may  content  ourselves  that  there  could  be  some  LeLR  which  can 
only  have  a  constant  speedup. 

From  the  above  analysis,  we  can  obtain  a  quick  estimate  of  the 
possible  speedup  of  a  given  program  loop  without  any  detailed  knowledge 
of  subscript  expressions  and  index  sets.   This  is  stated  as  follows. 
Given  a  loop  L,  let  us  define  recursively  (as  S  may  also  be  a  loop): 

IN(L)  =  U  INCS^  ,    1  <_   i  <_   s 

OUT(L)  =  U  0UT(S±),    1  <_   i  <_   s  , 

where  IN(S.)  and  0UT(S.)  are  the  set  of  variable  names  in  the  left-  and 

right-hand  sides,  respectively.   And  we  denote  Q(L)  =  IN(L)  (\   OUT(L). 

Theorem  7         For  any  L  and  a  sufficient  number  of  processors,  we 
have 

1)  if  Q(L)  =  <j>  ,  then  the  minimum  speedup  is 

Tl 
S  =  ..     =  0(T.)  . 
p   41og2e      r 

2)  if  Q(L)  ^  <f)  and  each  arithmetic  expression  on  the  right- 
hand  side  is  a  linear  combination  of  all  XeQ(L),  then  the 
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minimum  speedup  is 

T  T 

S   P  5 ~ -  0( M  • 

P   l/21og2n+3/21og2n+41og2e      lo82Tl 


where         e  =  maximum  number  of  atoms  in  assignment  statements, 
n  =  total  number  of  assignment  statements  executed. 

Proof :         It  is  easy  to  see  that  the  set  of  loops  which  satisfy 
Condition  (1)  (or  (2))  is  a  subset  of  R  (or  LR) .   Thus,  the  theorem 
follows  directly  from  Equations  (28)  and  (29). 

Q.E.D. 

The  conditions  stated  in  Theorem  7  are  over-restricted  if 
subscripts  are  analyzed.   For  example,  a  loop  like  L  =  (I)(S:A(I) 
=  A(I)  +  1)  has  Q(L)  3*  $  but  LeR.   We  will  discuss  more  refined  bounds 
in  section  4  based  on  such  information.   However,  we  can  use  this  as  a 
crude  approximation  about  the  average  speedup  of  ordinary  program  loops. 
Knuth  and  his  group  measured  7933  DO  loops  and  report  that  about  87%  of 
them  have  no  more  than  5  statements,  and  91%  of  them  have  within  3 
levels  of  nesting  [6].   Let  us  assume  typically  every  loop  has  3  assign- 
ment statements  and  forms  a  double  nesting,  i.e., 

L  =  (I1  *"  Nl'  X2  *  N2)(S1'  V  V  " 

It  is  also  reasonable  to  assume  e  =  3  and  | N1 |  =  |N_|  =  10.   Hence, 

n  =  3x10x10  =  300  and  T  =  300x2  =  600.   Our  experiences  in  measurements 

also  suggest  that  most  of  the  loops  satisfy  Condition  (2)  of  Theorem  7. 
Therefore,  we  can  say  that  typical  loops  in  ordinary  programs  can  be 
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speeded  up  by  a  factor  of  10,  and  in  many  practical  cases  up  to 

T 
log 


100  (~  -z — )   and  more   (~  T-),  if  a  sufficient  number  of  processors 

log2T1  1 


are  available.  We  have  observed  this  phenomena  in  our  measurements  of 
about  100  randomly  selected  programs  [7  ] . 

In  summary,  we  have  studied  loop  computations  theoretically 
from  their  expanded  forms.   By  using  the  results  of  Chapter  1,  we  are 
able  to  classify  all  loops  in  terms  of  their  possible  speedups  over  serial 
computation  time  T_  ,  i.e., 

Tl 
S  = r      for  0  <_   i  <_   2  ,  (30) 

P   a.dog^)1 

and  S  =  c  . 

P 

We  may  call  it  Type  i  loop,  0  <_   i  _<  2,  if  its  speedup  has  the  form  of 
Equation  (30)  or  Type  3  loop  if  S  =  c.  We  have  shown  examples  of  these 

4  types  of  loops.   Obviously,  the  computational  complexity  of  a  Type  i 
loop  is  higher  than  Type  j  loop  if  i  >  j .   Although  we  only  discuss  all 
loops  without  jump/logic  or  other  types  of  body  statements,  the  basic 
concepts  presented  here  can  be  easily  generalized  to  any  loop  structures, 
and  further  to  the  complexity  problems  of  different  algorithms. 

5.3  Exploitation  of  Parallelism  in  Program  Loops 

In  the  preceding  section,  a  theoretical  speedup  spectrum  has 
been  established  for  various  kinds  of  loops,  i.e.,  from  the  simplest 
Type  0  loop  to  the  most  complicated  Type  3  loop.   In  this  section,  we 
present  a  general  algorithm  which  may  be  used  in  the  semantic  analysis 
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portion  of  a  compiler  to  detect  inter-  and  intra-statement  parallelism 
in  program  loops  and  achieve  higher  speedups. 

The  central  idea  of  this  algorithm  is  to  distribute  (decompose) 
a  given  Type  i  loop  into  a  set  of  smaller  size  loops  of  Type  j , 
0  <_  j  _<  i,  which  upon  execution  give  the  equivalent  results  of  the 
original  one.   This  is  essentially  to  reduce  a.  in 

s  -    * 


P   ajdogjT/ 

(and  hence  increase  speedup)  as  much  as  possible. 

We  will  describe  the  algorithm  informally  by  using  examples. 
The  idea  of  loop  distribution  was  introduced  by  Muraoka  [10],  and 
later  was  implemented  in  our  Fortran  program  analyzer  to  measure  potential 
parallelism  in  ordinary  programs  [7],  [8].   Should  it  be  used  in  an 
actual  compiler,  the  details  will  depend  largely  on  different  levels  of 

sophistication  in  implementing  them  as  discussed  later.   Hence,  in  what 
follows,  we  will  address  only  the  basic  concepts  which  would  be  required 
in  an  idealized  loop  distribution  algorithm. 

Distribution  Algorithm 
Given  a  loop 

L  =  (L ,  lo»  *  *  *  »   H  '  i »  o *    • • • >  "Q'  * 

Step  1         By  analyzing  subscript  expressions  and  indexing  patterns, 
construct  a  dependence  graph  D  with  respect  to  loop  index  I   for 

1  <.  u  5.  d*   In  D  ,  there  are  s  nodes  representing  S.  for  1<_  j  <_  s.   An 
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arc  from  S,  to  S  indicates  that,  for  fixed  values  of  I- ,  I-,  ..., 
I   ,  an  atom  of  S  during  certain  iterations  has  been  updated  by  the 
output  variable  of  S,  during  an  earlier  (or  the  same  if  k  <  &)  iteration 
generated  by  the  nest  I  ,  I   1 ,  ...,  I,  . 


Example  5 


Given  a  loop 


DO  S2   I-  -  1,  10 


DO  S2   I2  =  1,  10 


DO  S2   I  =  1,  10 


A(I1,I2,I3)  =  B(I1-1,I2,I3)*C(I1,I2)  +  D*E 


B(IrI2,I3)  =  A(I1,I2-1,I3)*F(I2,I3) 


we  have 


Dr 


D, 


D3: 


Example  6 


Given  a  loop 


L: 


V 
V 

S3  = 
S5: 


DO  S  I  -  1,  10 
A(I)  -  A(I-l)  +  F(I) 
B(I)  =  A(I)  +  B(I+1) 
C(I)  =  D(I-2)  +  H(I) 
D(I)  =  C(I-l)  +  B(I) 
E(I)  =  D(I)  +  N(I+3) 
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we  have 


Step  2 


On  the  dependence  graph  D  for  1  <  u  <  d,  define  a 


partition  tt  of  {S  ,  S 


,  S  }   in  such  a  way  that  S   and  S   are  in 


the  same  subset  if  and  only  if  there  is  a  chain  of  arcs  from  S.  to  S 

k.     a 

and  vice  versa.   Elements  of  each  subset  are  ordered  by  their  subscripts, 

For  Example  5,  we  have 


TT;L  =  {(S1,S2)>,  7T2  =  {(S1),(S2)},  TT3  =  {(S1),(S2)}  , 


and  for  Example  6,  we  have 


7r1  =  {(s1),(s2),(s3,s4),(s5)} 


Step  3  On  the  partition  it  =  {if  ,,  i  „,  ...}   for  1  <  u  <  d, 
u     ul   u2  —   — 

define  a  partial  ordering  relation  y   in  such  a  way  that  tt  .  y   t\    . 

(reflexive),  and  for  i  ^  i,  tt  .  v  tt  .   if  there  is  an  arc  in  D  from  some 

Ul     uj  u 


element  of  tt  .  to  some  element  of  tt  . .   The  v  relation  is  also 
ux  uj       ' 
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antisymmetric  and  transitive. 

By  neglecting  those  relations  implied  by  reflexivity  and 
transitivity,  we  have  for  Example  5   (S.)  y    (S,,)  in  i^-   And  for 

Example  6,  we  have  (S^  y    (S2) ,  (S2>  y    (S3,S4),  (S3,S4)  y    (S5>  in  ir^ 


Step  4         Let  the  (d-u+1)  innermost  nest  of  L  be  L  for 


1  <  u  <  d,  i.e. , 


"  ttri2 W  "Wi V(srs2 V1 

■  (I1-Z2 W*"'  • 

Then  replace  L  according  to  tt  with  a  set  of  loops  {(I)(tt  .),(I)(tt  „)  , 

...}  where  (I)  =  (1^1^,...,:^). 

The  condition  for  the  partial  ordering  relation  y   insures  that 

data  are  updated  before  being  used.   Hence,  any  execution  order  of  the 

set  of  loops  which  replaces  L  will  be  valid  as  long  as  this  relation  is 

not  violated.   Hence,  for  fixed  values  of  I,  ,  I„,  ...,  I  , ,  if  it  .  v 

12        u-*l      ui 

tt    then  loop  (I)  (tt  .)  must  be  evaluated  before  (I)  (tt  .),  otherwise 
uj  r      ui  uj 

they  may  be  computed  in  parallel.  (In  general,  we  can  also  use  statement 

substitution  to  remove  this  relation  between  some  of  the  distributed 

loops.   This  will  be  discussed  in  the  next  section). 

For  instance,  in  Example  5,  we  have  now  three  possible  ways  in 

the  evaluation  of  L.   Besides  the  first  one  which  happens  to  be  L,  the 
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other  two  are 


and 


do  s   r-  *»l,  10 

DO   S    I   =  1,  10 

DO  S  I3  -  1,  10 
Sx:    A(I1,I2,I3)  =  BClj-1,1^, I3)*C(I1,I2)  +  D*E 

DO   S2   I2  =  1,  10 

DO  S2  I  =  1,  10 
S2:    B(I1,I2,I3)  =  A(I1,I2-1,I3)*F(I2,I3)  , 


DO   S    I  =  1,  10 


DO   S3   I  =  1,  10 


DO   S1   I  =  1,  10  DO   S2   I   =  1,  10 

S±:        A(IV12,13)   =  •..       S2:    BCI^I^I^  =  ... 


S3:    CONTINUE 

in  which  the  two  innermost  loops  are  computed  in  parallel. 

Similarly,  one  can  evaluate  the  loop  in  Example  6  as  four 
sequential  loops: 
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DO  S   I  =  1,  10 


S1:    A(I)  =  A(I-l)  +  F(I) 


DO   S2   I  =  1,  10 


S2:    B(I)  =  A(I)  +  B(I+1) 


DO  S,   I  =  1,  10 


S3:    C(I)  =  D(I-2)  +  H(I) 


S4:    D(I)  =  C(I-l)  +  B(I) 


DO  S5   I  =  1,  10 


S  :    E(I)  =  D(I)  +  N(I+3) 


Ideally,  we  would  like  to  choose  the  one  which  gives  the 
fastest  results  for  the  original  loop  L  under  the  given  machine  con- 
figurations.  For  illustration,  let  us  assume  an  arbitrary  number  of 
processors  are  available  and  statement  substitution  is  not  allowed. 
By  using  Example  5  again,  the  decomposition  according  to  D_  will  yield 

two  Type  0  loops  inside  the  nested  loops  I  and  I~,  and  each  of  them  can 

be  evaluated  in  parallel.   Thus, 

T  =  10xl0xmax{2,l}  =  200  . 
P 

We  may  also  distribute  it  by  D?  which  gives  us  two  sequential  type  0 

loops  inside  loop  I,  and 

T  =  (2+1)  x  10  =  30  ; 
P 
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or  by  D  which  may  result  in  an  R<2000,199>  system  after  preprocessing 

of  coefficients  as  described  in  the  previous  section.   Thus,  we  have 
for  this  case 

T  =  (log2199+2)  log22000  -  l/2(log2199+log2199)  +  log24 

=  76  . 

Clearly,  we  will  choose  the  second  decomposition  to  obtain  a  speedup  of 

s  =  li  .  4000 
p   T     30    ■LJJ  ' 
P 

Similarly,  the  loop  in  Example  6  will  be  distributed  into  a 
set  of  four  sequential  loops  (I) (S  )  ,  (I)  (S  ) ,  (I)(S  ,S.)  and  (I)  (S  ) 

which  are  Type  1  (m=l) ,  Type  0,  Type  1  (m=3)  and  Type  0  loops,  re- 
spectively.  Obviously,  both  loops  (I)  (S  )  and  (I)  (S-)  need  t..  =  t„  =  1 

Let  preprocessing  time  of  coefficients  of  loops  (I)(S1)  and  (I)(S_,S.) 

be  one  time  step,  then  the  computation  times  of  both  loops  are 

t3  =  21og210  =  8  , 

and  t4  =  (log23+2)  log^O  -  l/2(log23+log23)  =  17  . 

Thus ,  the  overall 

4 
T  =  I   t.  +  1  =  28  , 
P   1=1  X 

while         T  =  5x10  =  50  . 

So  that,  we  have  for  Example  6 

T 
s   =-ia  10 

p   T    28   L    ' 
P 


Ill 


It  is  interesting  to  note  that  both  loops  in  Example  5  and 
Example  6  are  Type  1  by  their  very  nature.   However,  the  speedup  of  the 
second  loop  is  far  smaller  than  that  of  the  first  one.   This  is  due  to 
the  big  difference  of  problem  sizes  (T  ratio  =  4000/50) .   If  we  in- 
crease the  number  of  iterations  of  the  second  loop  from  10  to  800,  then 
both  loops  will  have  the  same  size  but  now  we  can  obtain  a  speedup 

S   a  .-      *  62,  which  is  comparable  to  the  first  loop's  speedup. 

We  should  also  notice  that,  without  distribution,  the  loop  in 
Example  6  can  be  computed  as  an  R<50,9>  system.   It  yields  about  the 
same  speedup  (T  ~  27)  as  that  obtained  by  distribution,  but  requires 

far  more  processors  (see  Theorem  2)  to  achieve  the  same  speed.   This 
demonstrates  that  loop  distribution  may  not  give  better  speedup  but  it 
could  improve  the  efficiency  in  the  use  of  processors. 

In  summary,  the  distribution  algorithm  introduced  in  this 
section,  resembling  the  distribution  algorithm  for  the  reduction  of  tree- 
height  of  any  given  arithmetic  expression  [  8 ] ,  [10  ]  >  may  introduce  more 
parallelism  into  a  program  loop  than  that  obtained  from  an  undistributed 
one.   The  simpler  the  loop  structure  is,  the  more  parallelism  can  be 
extracted  by  the  loop  distribution  techniques.   Since  most  loops  in 
ordinary  programs  are  quite  simple,  we  believe  a  practical  loop  distribu- 
tion algorithm  with  reasonable  complexity  can  be  justified  to  be  built 
into  a  real  compiler  and  generate  efficient  object  codes  from  ordinary 
programs  for  future  parallel  machines. 
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5.4  Time  Bounds  of  Program  Loops 

In  section  2,  we  have  given  two  crude  bounds  on  computation 
time  without  any  subscript  information.   As  stated  before,  by  performing 
subscript  analysis  we  can  obtain  more  refined  bounds.   In  this  section, 
we  will  discuss  some  of  these  bounds  in  terms  of  certain  measurable 
loop  parameters  and  the  connectivity  of  its  dependence  graph.   The 
graph  is  the  product  of  subscript  analysis  or  a  computation  pattern 
that  we  know  a  priori. 

For  simplicity,  we  shall  concentrate  on  basic  loops  and  it  is 
not  difficult  to  generalize  the  discussion  to  non-basic  ones.   All 
parameters  used  are  listed  as  follows. 

Definition  9  For  a  given  basic  loop  L,  we  define 


1)  k  =  tt   N.   where  In.  I  is  the  number  of 

3-1  J  j 

iterations  of  loop  index  I .  and  d  is  the  number 

3 

of  nested  loops. 

2)  s  =  the  number  of  assignment  statements  in  L. 

3)  E={e.|l<i<s}     where  e.    is   the  number   of 

J1   ~   -  J 

atoms  in  S . .   The  maximum  e.  is  denoted  by  e. 
3  3 

4)  Q={q..|l£i»jf.s,   1  <  h£e.}  where 

q.  .      e  {<f>,1.2, .  . .  ,k)  is  the  number  of  iterations 

over  which  the  h-th  dependence  line  in  D- 

passes  from  S.  to  S..   If  there  is  only  one  such 
i     3 
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line,  we  omit  the  superscript  (h) .   The  maximum 

q..   is  denoted  by  q. 
ij 

With  these  parameters,  any  basic  loop  can  be  represented  by  a 
four-tuple,  i.e.,  L(k,s,E,Q).   For  instance,  the  loop  in  Example  5  is 
L(1000,2,  {4,2},  {q  =10,  q  =100})  and  in  Example  6  is  L(10,5,  {2,2,2,2,2} 

{qll=1'  q12=0'  q24=°'  q34=1'  q43=2'  q45=0})' 

Now  we  will  present  various  bounds  based  on  this  loop  informa- 
tion.  The  simplest  case  is  a  loop  whose  dependence  graph  is  empty,  i.e., 
Q=cf>.   This  implies  that  LeR  and  hence  from  Equation  (27)  we  obtain 

Theorem  8  Given  any  L(k,s,E,Q),   if  Q=cf) 

then     T  <_   41og_e. 

For  the  set  of  loops  with  non-empty  graphs,  we  can  easily 
distinguish  two  cases,  i.e.,  acyclic  and  cyclic  graphs.   The  acyclic  graphs 
can  be  readily  recognized  as  either  tree-connected  or  non-tree-connected. 
Formally,  we  define  these  as  follows. 

Definition  10  An  acyclic  dependence  graph  is  a  dependence  graph 

of  s  nodes,  S.  for  1  <^   i  <^  s,  with  no  pair  (S.,S.)  such  that  there 

exists  a  chain  of  arcs  (dependence  lines)  from  S.  to  S   and  vice  versa. 

If  there  is  only  one  chain  of  arcs  between  any  pair  of  nodes,  we  call  it 
tree-connected .   Figure  14  shows  all  these  cases. 


114 


(a)   Acyclic 
& 
Tree- connected 


(b)   Acyclic 
& 
Non- tree-connected 


(c)   Cyclic 


Figure  14.   Connectivity  of  Dependence  Graph 

For  acyclic  graphs,  it  is  easy  to  demonstrate  that  we  can 

perform  statement  substitution  between  any  pair  of  nodes  which  have  a 

dependence  relation:   that  is,  we  substitute  for  each  left-hand-side 

variable  of  S.  on  the  right-hand  side  of  S.,  which  is  the  cause  of  a 
i  3 

dependence  relation,  the  corresponding  arithmetic  expression  on  the 
right-hand  side  of  S.  with  all  subscript  expressions  shifted  in  ac- 
cordance with  the  q. .   representing  this  relation. 
For  example,  given  a  loop 


sr 


DO   S,   I  =  1,  10 
A(I)  =  E(I-l)  +  F(I) 
B(I)  =  A(I)  +  A(I-l)  , 
C(I)  =  B(I-l)  +  G(I) 
D(I)  =  A(I-2)  +  H(I)  , 


D, 
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we  have  Q  =  {q|2  ,  q{2  »  q23'  q14*  "  *°'  1*  1*  2*  *   By  *PP1yin8  thij 
procedure,  a  set  of  four  independent  loops  results: 


DO  S|  I  -  1,  10 


Sj:   A(I)  =  E(I-l)  +  F(I) 


DO   S^   I  =  1,  10 
S^:    B(I)  =  E(I-l)  +  F(I)  +  E(I-2)  +  F(I-l) 


DO  S'   I  =  1,  10 
S^:    C(I)  =  E(I-2)  +  F(I-l)  +  E(I-3)  +  F(I-2)  +  G(I) 

DO  S'   I  =  1,  10 
S^:    D(I)  =  E(I-3)  +  F(I-2)  +  H(I)  . 

And  all  of  them  can  be  computed  in  parallel  as  the  dependence  relations 
among  them  have  been  removed  by  this  process . 

Evidently,  T  and  p  of  each  resultant  statement  S ' ,  1  j<  j  <_  s, 

can  be  bounded  by  the  number  of  atoms  present  in  its  final  form.   The 
growth  of  atoms  in  each  resultant  statement  can  be  easily  observed  from 
the  fact  that  if  S  has  y.,    chains  of  arcs  connected  to  S,,  then  S 

will  contribute  Yj.(eJ~l)  atoms  to  S|  .   Hence,  in  general  the  number  of 

atoms  in  S|,  1  <  j  <  s,  will  be  the  summation  of  e,  over  the  contributions 
J    -   -  j 

from  all  other  nodes  which  have  at  least  a  chain  of  arcs  connected  to  S  . 
This  is  shown  in  the  following  example  in  which  S!  contains 
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e  +  (e  -1)  +  2(e2-l)  +  (e^-l)  atoms 


© 


(erl)  X 


Now,  we  can  state  a  general  theorem  for  the  class  of  loops  with  acyclic 
dependence  graphs. 


Theorem  9 


Given  L(k,s,E,Q)  with  acyclic  dependence  graph.   If 


statement  substitution  is  allowed,  then  we  have 

1)  for  a  loop  with  a  tree-connected  graph  T  <_   41og9(se). 

2)  for  a  loop  with  a  non-tree-connected  graph  T  <_  41og~(s'e) 


where 


Z-l 


s-1 


s"  =  (s-£+2)r    <_   2r    <   r   , 

&     =  maximum  length  (number  of  elements)  of  the  chains 

of  partially  ordered  set  n        as  defined  in  section  3 

(31  £  si    . 
r  =  maximum  number  of  chains  from  any  element  of  ir1  . 


Proof: 


For  the  tree-connected  case ,  any  S . ,  1  <_  j  <_  s ,  can  have 


at  most  one  chain  of  arcs  from  S.,  1  <  j  <  s  and  i  ^  j.   Hence,  let  e1 

1    ~   ~  J 
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be  the  number  of  atoms  of  S',  then 


e!  <^  e.  +  Z  (e_.-l)  5  se 
1=1 


1=1  X 


Since  all  Si,  1  _f_  j  £.  S,  can  be  evaluated  in  parallel,  part  (a)  follows 

directly  from  Brent's  argument. 

For  the  non-tree-connected  case,  there  are  at  most  r  chains 

(r  >_  2)  of  arcs  coming  out  of  one  node  and  the  maximum  length  of  these 

chains  is  l.      Thus,  the  maximum  contribution  by  S .  to  S!  is 

i     J 

£-1      SL-1 
r   e.(j<  r   e)  .   Let  all  (s-A+1)  nodes  contribute  in  the  same  manner, 

e!  <  e.  +  (s-£+l)rJ!'"1e  +  (r£_2+r£~3+. . .+r)e 
J  -  J 

<  (s-£+2)/-1e  <  2rS_1e  <  rSe  . 

Therefore,  part  (b)  is  also  proved. 

Q.E.D. 

Next,  we  study  the  case  when  statement  substitution  is  not 
allowed.   In  this  case,  we  can  prove  that 

Theorem  10  Given  any  L(k,s,E,Q)  with  acyclic  dependence  graph. 

If  statement  substitution  is  not  allowed,  then  we  have 


T  <  4£log_e  <  4slog_e  . 
p  —      2  -     °2 
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Proof:  Since  tt  has  the  maximum  length  of  the  chains  i,  then 

the  elements  in  it  can  be  partitioned  into  £  disjoint  antichains  (subsets 

of  elements),  and  in  each  antichain  there  is  no  relation  among  any  pair 
of  its  elements.   This  is  a  dual  form  of  Dilworth's  Theorem  [27] •   Hence, 
all  decomposed  loops,  each  of  which  represented  by  one  element  of  a 
particular  antichain,  can  be  computed  in  parallel.   Since  the  graph  is 
acyclic,  each  of  such  loops  has  Q=<f>  with  no  more  than  one  statement. 
Thus,  by  Theorem  8,  the  computation  time  of  all  loops  represented  in  one 
antichain  is  no  more  than  41og~e,  and  for  I   antichains  in  total  we  have 

T  £  4Uog  e. 

Q.E.D. 

Theorem  9  and  Theorem  10  give  two  extreme  bounds  with  and  with- 
out statement  substitution.   Needless  to  say,  that  any  combination  of 
partial  substitution  and  partial  distribution  will  yield  time  bounds 
between  these  two  results. 

Also  note  that  Theorems  8-10  can  be  applied  to  bound  the  compu- 
tation time  of  a  block  of  assignment  statements,  as  it  is  a  special  case 
of  a  loop  with  k=l,  q..   =  {^,0}   for  all  h  and  i  <  j,  and  q  .  =  $ 

for  i  >^  j  . 

Finally,  let  us  examine  those  loops  with  cyclic  dependence 
graphs.   Recall  from  the  previous  section  that  any  element  tt_  .  of  tt- 

with  multiple  statements  actually  forms  a  maximal  cycle,  i.e.,  it  con- 
tains all  smaller  cycles  formed  by  its  elements.   Those  statements 
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represented  in  a  maximal  cycle  are  coupled  together  by  dependence 
relations  and  no  statement  substitution  is  possible.   By  treating  such 
a  cycle  as  a  single  loop,  we  can  compute  it  as  a  recurrence  system  if 
it  is  linear.  Hence,  we  can  use  all  results  in  the  previous  chapters 
to  bound  the  time  and  processors  for  these  maximal  cycles.   In  particular, 
we  can  state  a  general  theorem  for  LeLR  as  follows. 

Theorem  11  Given  any  L(k,s,E,Q)  with  cyclic  dependence  graph, 

then  if  LeLR,  we  have 

T  £  [(log2(cq+c-l)  +  2)  log2(ck)].mina,t)  +  4Uog2e    (31) 


<  1/2  log^Csk)  +  3/21og2(sk)  +  41og2e  (32) 


where        c  =  maximum  number  of  nodes  in  one  maximal  cycle  of  D  , 

which  is  equivalent  to  the  maximum  number  of  statements 
represented  by  one  element  of  the  partially  ordered 
set  it.  (1  <_  c  <_  s) . 

t  =  total  number  of  maximal  cycles  in  D..  . 

Proof:         We  first  show  it  for  c^s.   In  this  case,  as  discussed  in 
the  proof  of  Theorem  10,  the  elements  of  tt.  can  be  partitioned  in  I   dis- 
joint antichains.   Any  decomposed  loop,  represented  by  one  element  of  a 
particular  antichain,  contains  no  more  than  c  statements.   Hence,  the 
maximum  computation  time  of  all  loops  represented  in  one  antichain  is 
equivalent  to  that  of  an  R<n,m>  system  with  at  most 
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m  =  c(q-l)  +  (c-1)  +  c  =  c(q+l)  -  1  , 

plus  the  maximum  time  for  preprocessing  coefficients  (41og_e) .   Since 

there  are  at  most  min(£,t)  maximal  cycles  in  a  chain,  Equation  (31) 
is  proved  by  Theorem  2.   For  the  case  c=s,  Equation  (32)  directly 
follows  from  Theorem  1. 


Q.E.D, 


In  summary,  we  have  given  Theorems  8-11  to  estimate  time  bounds 
(and  hence  speedups  since  T   ~  ske  for  all  cases)  for  a  large  class  of 

program  loops  if  their  computation  patterns  are  known  or  reconstructed 
by  subscript  analysis.   These  bounds  are  based  on  some  gross  parameters 
such  as  (s,k,e,q,£,c, t)  which  we  used  here.   Within  a  specific  class  of 
problems  and  their  computational  programs,  e.g.,  eigenvalue  programs  or 
linear  system  solvers,  users  usually  have  a  very  good  idea  of  average 
computation  patterns  and  typical  values  of  these  parameters  inside  their 
program  loops.   With  these  kinds  of  analytical  bounds,  they  can  have  a 
general  idea  about  the  ultimate  speedup  range  they  can  expect  from  their 
programs  running  on  a  multiprocessing  system.   This  may  also  help  the 
designer  of  computers  for  a  given  class  of  computations  in  knowing  how 
well  they  are  doing  in  comparison  to  these  possible  speedups  with  an 
arbitrary  number  of  processors. 

Although  we  do  not  give  time  bounds  for  a  fixed  number  of 
processors,  due  to  their  infinite  number  of  variations,  it  is  conceivable 
that  by  using  some  tables  like  that  in  Chapter  2  or  eventually  some 
analytically  derived  formulas,  we  can  also  obtain  some  of  these  results 
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for  practical  usages.   Similarly,  a  careful  redefinition  of  parameters 
and  detailed  studies  of  loop  structures  should  be  able  to  generalize 
our  discussions  here  to  non-basic  loops,  and  further  to  loops  with  jump/ 
logic  control  informations. 

5.5  Practical  Implications 

As  discussed  in  sections  3  and  4,  a  good  compiler  algorithm  is 
desirable  to  decompose  a  given  loop  into  a  set  of  loops  which  can  be 
evaluated  efficiently  for  a  fixed  machine.   The  complexity  of  loop 
distribution  algorithms  will  be  decided  by  the  levels  of  optimization  we 
want  and  the  given  machine  structure  we  have. 

For  instance,  we  have  various  ways  of  constructing  the 
dependence  graph  from  a  loop: 

1)  only  analyze  the  input  and  output  variable  names  of  each 
assignment  statement  without  checking  of  subscript  expressions  at  all,  or 

2)  Subscript  analysis  is  performed  but  renaming  procedure  is 
not  allowed.   In  this  case,  for  a  loop  like 


DO   S2   1-1,  N 
S1:   A(I)  =  B(I)  +  B(I-l) 
S2:    B(I)  =  C(I)  +  D(I+1)  , 

we  will  have  a  dependence  line  from  S«  to  S. ,  and  an  artificial  line 
from  S.  to  S.  even  though  S-  never  updates  S_  in  the  loop;  otherwise,  we 
may  distribute  and  evaluate  (I)  (S.)  and  (I)  (S.. )  sequentially.   The  latter 
will  obviously  result  in  erroneous  solutions  as  the  first  atom  B(I)  in 


122 


S  needs  old  values  instead  of  the  updated  ones  generated  by  this  new 

execution  order.  Hence,  the  artificial  line  inserted  will  lock  up  these 
two  statements  from  being  distributed.  This  technique  is  used  in  TI  ASC 
Fortran  compiler  [24]-  On  the  other  hand,  if  we  are  allowed  to  rename  B 
in  S  and  the  second  B  in  S1  (as  well  as  B(0)  outside  the  loop)  to  a  new 

variable  name,  e.g.,  B',  then  the  distribution  is  possible.   Techniques 
to  create  these  new  storages  is  reported  in  [10]. 

We  also  mentioned  that  statement  substitution  between  two 
decomposed  loops  which  have  dependence  relation  may  enable  them  to  be 
evaluated  in  parallel.   However,  if  the  machine  structure  does  not  allow 
more  than  one  array  fetched  at  the  same  time,  then  the  effect  on  speedup 
of  statement  substitution  will  be  degraded  or  not  preferred  at  all. 

Finally,  let  us  look  at  one  decomposed  loop,  e.g., 

DO   5   I  =  1,  N 
DO   5  J  =  1,  N 
5   A(I,J)  =  A(I,J-1)  +  A(I-1,J)  , 

which  has  two-dimensional  recurrence  relations  and  may  be  treated  as  a 

2 
very  sparse  R<n,m>  system  with  n=N  and  m=N.   Although  we  can  solve  it 

by  the  previous  general  algorithms,  for  a  system  with  very  small  p,  the 

wave-front  method  in  [10],  [23]  may  also  be  attractive  because  of  its 

simplicity.   On  the  contrary,  if  the  above  statement  is  changed  to 

5   A(I,J)  =  A(I,J-1)  +  A(I-1,  J+k) 

where  k  is  a  relatively  large  number,  then  the  latter  method  will  become 
very  inefficient. 
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Hence,  we  can  conclude  that,  to  design  an  efficient  system 
for  executing  program  loops,  a  general  compiler  simulator  can  be 
established  to  distribute  loops  according  to  different  machine  con- 
figurations and  various  parallelism  optimization  levels.   Each 
distributed  loop  can  be  assessed  with  computation  time  by  the  time 
bounds  described  in  this  section,  or  more  practically  by  a  general 
machine  simulator  to  obtain  the  overall  speedup  of  the  original  loop. 
Experiments  of  this  kind  on  a  large  sample  of  ordinary  programs  will 
indicate  which  options  will  influence  average  speedup  (or  efficiency) 
in  major  ways.   As  a  result  of  such  studies,  a  practical  loop  distribu- 
tion algorithm  of  compiler  with  reasonable  complexity  can  emerge  for  a 
given  machine;  and,  in  general,  an  efficient  structure  of  the  machine 
for  ordinary  programs  will  also  be  revealed. 
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6.   CONCLUSION 

In  this  thesis,  we  have  shown  efficient  algorithms  for  parallel 
evaluation  of  general  linear  recurrence  systems  and  m-th  order  linear 
recurrence  systems,  with  arbitrary  or  constant  coefficients.   Detailed 
results  have  been  given  in  the  introduction  of  each  chapter  and  need  not 
be  repeated  here. 

These  algorithms  could  be  used  directly  as  a  building  block  in 
forming  parallel  algorithms  for  standard  numerical  programs  such  as  in 
[28  3,  [29].   We  can  also  use  them  in  compilers.   With  the  loop  distribution 
algorithm  discussed  here,  a  compiler  should  be  able  to  decompose  a  given 
loop  into  loops  which  can  be  computed  as  vector  operations  or  recurrence 
relations.   The  latter  case  can  be  further  transformed  into  vector 
operations  (or  matrix  operations  if  the  machine  structure  allows)  by  our 
algorithms  without  programmer  intervention. 

All  algorithms  described  in  this  thesis  are  primarily  for 
conceptual  understanding.   A  properly  designed  machine  organization  plus 
some  practical  modifications  of  the  basic  algorithms  presented  here  for 
that  particular  system  should  provide  us  high  efficiency  computations  of 
iterative  programs. 

We  conclude  this  thesis  by  giving  the  following  several  topics 
which  are  opened  up  by  this  research  and  deserve  further  studies: 

1)   Is  there  any  other  algorithm  which  can  be  used  to  evaluate 

general  linear  recurrence  systems  in  O(log„n)  time  steps  instead  of 

2  2 

O(log-n)  steps?   Since  only  0(n  )  data  elements  are  involved  in  matrix  A 

and  c,  it  follows  from  a  fan-in  argument  that  one  might  hope  for  a  faster 
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scheme  requiring  only  logarithmetic  time.   Otherwise,  could  we  disprove 
this  possibility  by  some  abstract  combinatorial  argument?   Similar 
questions  can  be  raised  for  m-th  order  linear  recurrence  systems. 

2)  Transformation  of  various  standard  numerical  algorithms 
into  algorithms  which  use  linear  recurrences,  such  as  the  ones  mentioned 
above  and  recently  in  [30],  could  be  explored  for  future  parallel 
algorithms. 

3)  Practical  compiler  algorithms  for  decomposition  of  loops 
with  jump/logic  statements  and  various  indexing  patterns  should  be 
studied  in  detail. 

4)  Numerical  stability  problems  of  these  basic  algorithms 
require  more  understanding,  although  it  is  expected  that  the  error 
bounds  will  be  within  the  range  of  serial  computations. 
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